A primer on model-guided exploration of fitness landscapes for biological sequence design
AA primer on model-guided explorationof fitness landscapes for biological sequence design
Sam Sinai and Eric D. Kelsic
1, 2
Abstract
Machine learning methods are increasingly employed to address challengesfaced by biologists. One area that will greatly benefit from this cross-pollinationis the problem of biological sequence design, which has massive potential fortherapeutic applications. However, significant inefficiencies remain in com-munication between these fields which result in biologists finding the progressin machine learning inaccessible, and hinder machine learning scientists fromcontributing to impactful problems in bioengineering. Sequence design can beseen as a search process on a discrete, high-dimensional space, where each se-quence is associated with a function. This sequence-to-function map is knownas a “Fitness Landscape” [1, 2]. Designing a sequence with a particular functionis hence a matter of “discovering” such a (often rare) sequence within this space[3]. Today we can build predictive models with good interpolation ability dueto impressive progress in the synthesis and testing of biological sequences inlarge numbers, which enables model training and validation. However, it oftenremains a challenge to find useful sequences with the properties that we like us-ing these models. In particular, in this primer we highlight that algorithms forexperimental design, what we call “exploration strategies”, are a related, yet dis-tinct problem from building good models of sequence-to-function maps. Wereview advances and insights from current literature -by no means a completetreatment- while highlighting desirable features of optimal model-guided ex-ploration, and cover potential pitfalls drawn from our own experience. Thisprimer can serve as a starting point for researchers from different domains thatare interested in the problem of searching a sequence space with a model, butare perhaps unaware of approaches that originate outside their field.
Correspondence: [email protected], [email protected]:Dyno Therapeutics, Cambridge, MA 2: Wyss Institute For Biologically Inspired Engineering at Har-vard Medical School, Boston, MA 3: Harvard University, Cambridge, MA a r X i v : . [ q - b i o . Q M ] O c t ontents
1. Preliminaries
While we attempt to cover the advances that capture methods in approaching se-quence design (or adjacent) problems, this is not a review and the field would al-ways move faster than we can update this document. We often find ourselves ex-plaining different facets of the challenge to researchers from different fields, andhence we think it might be useful as a resource to share and lower the barrier forworking on this problem, or perhaps help in making known approaches more ac-cessible. As the purpose of this writing is to enable better algorithms, we focus onmethods (rather than results) when discussing approaches that were attempted to-date. 2 . What is the problem?
A “fitness landscape” is defined as a map
Φ : x → y between biological sequences X = { x , · · · , x n } and their fitnesses Y = { y , · · · , y n } [1]. Sequences X are eachmade up of alphabet (residues) Σ = { σ , · · · , σ k } . Here, the term fitness capturesthe biological desirability of a sequence. In nature, this can refer to the effect of mul-tiple functionalities y = { y , · · · , y m } together on the organism’s ability to survive orreproduce. For engineers, it might refer to other desired properties that don’t neces-sarily align with biological fitness. The challenge for evolution or bioengineers is tofind sequences x that exhibit some desired profile of functionalities y ∗ [3]. The pur-pose of the primer is to introduce the reader to exploration algorithms E that query Φ as oracles to find sequences with the desired properties y ∗ . Given that we do notknow if sequences we are looking for actually exist, this may be an impossible task.However, assuming such profile exists, we aim to propose E that can find solutionswithin some acceptable distance of y ∗ with high probability.In order to achieve this objective, it is often helpful to build a model of the land-scape, Φ (cid:48) because accessing Φ directly is rather costly. Even evolutionary processeshave an implicit model of fitness landscapes: They assume that the current mem-bers of the population are the most promising candidates to be modified towards abetter sequence, i.e. there is correlation between nearby sequences and their func-tion. Engineers can build explicit models of the landscape Φ (cid:48) , which we will discussin section 3.In its simplest form the challenge would be to find a single sequence that sat-isfies a single criterion. For example, the green fluorescent protein (GFP) is a 238amino-acid protein that is a good experimental test-case for machine-guided mod-eling and optimisation [4, 5, 6, 7, 8, 9], and one may look to improve its fluorescence( y ∈ R ), subject to the constraints that the protein still successfully folds. | Σ | = 4 for DNA and RNA nucleotides, and | Σ | = 20 for standard amino-acids which are thebuilding blocks of proteins .1 Why are biologists interested in building models of the landscape? While optimising a protein like GFP my not be very impactful per se , as it is prettygood at what it does already, there are other natural proteins that perform the taskswe are interested in, but we would greatly benefit from optimising them. For in-stance, capsid engineering [10], a domain that we work on, stands to benefits mas-sively from improved capsids that enable delivery of genes into specific tissues inorder to treat currently untreatable genetic diseases. While the known natural vari-ants of these capsid proteins show the desired traits, their current capabilities arefar below the levels that make them appealing as therapeutics.Furthermore, as these landscapes are extremely large, we must understand alandscape’s structure by sampling a very small portion of it. Evolutionary biolo-gists are particularly interested in understanding landscape structure, because itcan help make the process of evolution more predictable [11, 12, 13]. Such under-standing has theoretical as well as practical implications. Practically, it can help usin estimating the probability of developing drug resistance for certain pathogensbetter, or even estimating the chances of a virus jumping from animal hosts to hu-mans. More generally the structure of the landscape determines how hard it is forevolution to “discover” novel functions. These can help us progress towards under-standing big scientific questions. For instance, we know that life on earth eventuallydiscovered oxidative processes as a means of producing energy. Before that, Oxygenwas very toxic to most living organisms, and the “Great Oxidation Event” resulted ina mass extinction [14]. But it is by no means obvious how likely it was for evolutionto learn how to use oxygen to make ATP (the cell’s energy currency). Is the fitnesslandscape full of possible adaptations that allows for the use of oxygen? or is it thatperhaps the “fitness optimum” is atop a wide base where hill-climbing would getyou there from many locations in the sequence space (however slowly)? Affirma-tion of either of these would suggest that these types of adaptations are probable.On the other hand, we often assume that a random sample of biological sequencesis quite unlikely to contain sequences that exhibit a given function. In a famousexperiment, Bartel and Szostak tried to provide some estimate of this challenge by4creening a large pool of randomly synthesized RNA molecules, which were testedfor their ability to ligate certain products[15] . In their experiment, the frequency ofdetecting a functional sequence was roughly − .To summarize, there is value in building better estimates of Φ for any landscape,first because it helps us optimize on the landscape, and second, because being ableto estimate the landscape overall can help us understand the outcome of stochas-tic processes within it better. These are distinct problems, as in the second case,we would like the error of the estimator to be low on average, whereas in the opti-mization case, we simply care about finding high-performing variants. Dependingon which objective we prioritize, or the type of model we desire, we might samplesequences within a landscape differently [16].We will call a sequence with a particular arrangement of alphabet a variant . Of-ten, in biology, the process of exploring a landscape involves trying many variants inparallel, which constitutes a population . In natural (evolutionary) settings, it is of-ten the case that multiple copies of the same variant exist in the population. In syn-thetic assays, it is more desirable to test unique variants (duplicates are included forcalibration). We also assume a fixed population size. In the approaches discussedhere, we importantly assume that the fitness landscapes are static , that is they don’tchange over time or as the frequency of different variants in the population change(those are called “seascapes”). I.e. a population evolving over a landscape does notchange the properties of the sequence-to-function map. Generally speaking, theseare not true for natural organismal evolution, because the context changes the map,but hold for biological sequence design. The difference between E (or population)trying to optimize on static landscapes versus one that attempts to optimize on a dy-namic one is like the difference between an indoor rock climber and a tennis player.In one case the challenge is somewhat fixed, while in the second it is opponent andtournament dependent. 5 .2 Why is this problem hard? The problem of optimizing sequences on fitness landscapes is hard because theirmap Φ often has complex, non-convex structure and landscapes are unimagin-ably large. Biological sequences often exhibit significant amount of epistasis , whichrefers to the influence of genetic context on the effects of a change within a se-quence. In the sequence design setting, if the effects of swapping one letter foranother in a position (also termed locus ) in the sequence were independent of whathappened elsewhere, we could feasibly measure the effects of changing a letter toanother , for every position in the sequence (e.g. [17]) and then design the opti-mal sequence by picking the best letter at each position. Hill-climbing processescan optimize very easily in this setting as well [18]. While these types of landscapesmay approximate the vicinity of a particular sequence well[17, 13, 7, 19, 11], single-peaked landscapes are thought to be rare, and there is evidence to suggests that theglobal structure of landscapes can be quite heterogeneous [13].Epistatic effects come in multiple forms, magnitude epistasis refers to caseswhere one allele’s presence changes the magnitude of the effect of another one (e.g.diminishing returns: if you have an apple, a second apple is less appealing to you). Sign epistasis refers to situations where an allele’s presence changes the sign of theeffect of another allele (e.g. you are very sleepy, you drink coffee, you feel better, younow drink an espresso, you feel worse because you are anxious now, but no worsethan you were before you had coffee). In the extreme cases, reciprocal sign epista-sis can lead to cases where two alleles that are good independently, end up beingdeleterious together (or vice versa, e.g. menthol and coke may taste fine by them-selves, but having them together is extremely unpleasant). Due to epistatic effects,optimisation on biological landscapes involves dealing with many local optima.Secondarily, sequence spaces scale exponentially in sequence length L . The to-tal number of possible sequences of a certain length is | Σ | L where L is the lengthof the sequence, and many sequences of different lengths may perform the samefunctions, often resulting in a huge space of possible candidates that exhibit similar Population geneticists call a particular pattern of letters within particular loci, an allele . , suggest-ing that the solution space cannot be too difficult (or otherwise we would not seeevolution succeed). The most promising biological inventions that seem to rem-edy search times on such landscapes are recombination and the so-called “regen-eration” processes where you reset your walks in the sequence space through geneduplication (ensuring you don’t break your current best answer too much, and capyour distance from a potential solution) [21, 24, 25]. An exploration process on a fitness landscape can be described much like a Gener-ative Adversarial Network (GAN) [26]: (a) A generator that makes a set of proposalsequences (b) A critique that evaluates the sequences produced by the generator(see Figure 1). Nature is the ultimate critique, but asking it questions is expensive.If we have a model that can propose functional sequences, then we can leave thejob of the critique to nature. However, if our generator has a high rate of makingpoor choices (say it picks at random), then we would want to preface queries to na-ture by simulating the critique before committing to synthesizing new sequences.This involves building accurate models Φ (cid:48) and using them as critiques. However,unlike GANs, error-propagation is not always possible throughout the entire chain,depending on our choice of generator and critique (you can’t train nature’s param-eters).All algorithms (natural or otherwise) that explore fitness landscapes require both For some static fitness landscapes Valiant [22] showed that standard evolutionary processes can-not find optimal solutions when other learning algorithms do. A recent study by Kaznatcheev pro-poses that frequency-dependent dynamics (which we do not consider here) can overcome that barrier[23]. : Essential components for sequence design Algorithms Propose BatchUpdate ModelMeasure Ground truth
SynthesizeD: Life-cycle of model-guided sequence design
ProductionState tState t+1 num of variants
Generation ProcessSelection Process
ReproduceGen t
Redundant variants
Natural Mutation and RecombNatural Selection
Reproduce/ SynthesizeGen t
Redundant variants
EP-PCR DNA Shu ffl ing,…Lab-based Assay Direct SynthesisBatch t
Generative model
Unique variants
B: Natural evolution C: Directed evolution E: Model-guided exploration
Gen t+1 Batch t+1Gen t+1 Virtual candidates
Critical modelLab-based Assay
In-silico cycles T i m e
3. How do we currently make models of fitness landscapes?
Fitness landscapes have been of interest to biologists for about a century[1]. Thereare two qualitatively distinct ways to approximate and study biological fitness land-scapes: Empirical approaches based on experiments, and statistical approachesbased on models which we will cover in detail. Empirical studies of fitness land-scapes generally fall in two categories: First, studies in which samples are takenfrom an evolving population under natural or directed selection. These studiesprovide the benefit of producing diverse and large datasets that can be generatedat once with little cost [27, 28]. They also are relatively faithful models of evolu-tionary process. However, by definition, most of these studies explore the “viable”regions of the landscape, leaving potentially interesting parts of the landscape un-touched. It is hence difficult to know if a region of the landscape is non-viable orunexplored. Second, studies that sample variants using random or targeted mu-tagenesis [29, 2, 11, 13, 4, 30, 31]. In these studies one can synthesize a set of se-quences and assay their performance in the laboratory. Advances in sequencingtechnology has allowed approaches such as error-prone PCR and DNA-reshufflingto generate sequences with random variation, where poor mutants can also be ob-served. Finally, direct synthesis approaches, the most recent development in thedomain, allow for a large number of specifically designed edits to be synthesizedand tested. The samples tested can be precisely selected, hence they are of high in-formational value. However, mutational scanning is labor and resource intensive,and many orders-of-magnitude fewer samples are explored.It is transparent that the two approaches are synergistic and can be combined.The first step has been to use available experimental data to infer consistent pa-rameters for each class of models (described below). Encouragingly, simple mod-els such as Additive, low-order polynomial models, and global epistasis, have been9ell-suited to describe the local properties of experimentally probed landscapes[7,32, 33, 17]. Due to multiple limitations in how well models can capture complexlandscapes for which long-range regularity is not well-understood [13, 34, 35], it ishard to predict the fitness of variants that are far from those already measured. Amodern synthesis would likely involve mapping the insights gained from the em-pirical data back to refine the models that best describe these landscapes, hopefullyin an interpretable, and generalizable manner. In this section, we will cover someclasses of models that have provided useful insights over the years.
For most of the time that the concept of fitness landscapes existed, measuring theeffects of many mutations was hard. As a result, biologists have proposed multi-ple models of fitness landscapes that did not rely on empirical data, but aimed toemulate the properties of real landscapes.The most fundamental but simple models are those in which the effects of mu-tations are independent, hence the collective effects of all mutations can be com-puted through an additive function. These landscape set the benchmark to com-pare all other landscapes with because at the core, additive models capture the ef-fects of a letter (allele) in a particular position (locus) when considered independentof each other. Assuming we have a sequence x := { σ i ∈ Σ , ∀ i | σ , σ , ..., σ L } of length L , the function (or fitness) Φ( x ) can be defined as: Φ( x ) = L (cid:88) i =1 φ ( σ i ) ω i (1)If all weights ω i = 1 , and φ ( σ i ) is the effect of a specific mutation at a specificlocation. This is the simple additive model . However as described above, bio-logical landscapes often deviate from additive (or linear) models, and a variety ofapproaches are designed to capture this deviation based on different perspectives. While generally used interchangeably, it’s preferable to refer to the model as “linear” when ω i (cid:54) = 1 .
10n the most basic form (sometimes used as the “null” model) epistasis is com-pletely unstructured (random). This is termed House of Cards (HoC) epistasis. Land-scape models can be built from a mixture of HoC and additive effect to incorporatemeasurement noise or otherwise incompressible fitness effects .A generalization of additive models is known as “Global epistasis” models wherethe additive baseline is transformed by a monotonic but non-linear function [7, 4,36] . Despite their simplicity, and since additive landscapes can be easily inferred,these landscapes are somewhat successful in describing empirical mutational scandata sets [34, 2, 11, 35, 32], possibly because of the local regions that these scanstend to cover (there may be one dominant peak in the local neighborhood, as as-sumed in [19, 13]), which could be separated by large “valleys” [37]. Φ( x ) = Γ (cid:16) L (cid:88) i =1 φ ( σ i ) ω i (cid:17) + η (2)Where Γ is a non-linear function and η represents HoC epistasis.A different group of elaborations on additive models aim to capture higher-orderepsitatic interactions by assigning (or inferring) weights associated with interac-tions. In the most common form, they are written as coefficients of a regression. Φ( x ) = ω + L (cid:88) i =1 φ ( σ i ) ω i + L − (cid:88) i =1 L (cid:88) j = i +1 φ ( σ i , σ j ) ω ij + ... (3)Where ω i are weights on individual loci (positions), ω ij are weights on pairwiseinteractions and so on. Inspired by Ising and Potts models, these can be used asgenerative descriptions of landscapes with certain amount of interaction amongpositions, or with some tricks, the parameters ω can be approximated efficiently [38, In a interesting recent study, Agarwala and Fisher investigate the dynamics of evolutionary walkson such landscapes that are locally additive (and correlated) but become less correlated as the dis-tance to each peak is increased [19] A neural network based implementation can be found here: https://mavenn.readthedocs.io/en/latest/ N is the length of the genome, and K is the number of positions inthat genome that affect the fitness contribution of a single locus. The K parametertunes the landscape for “ruggedness”, where K = 0 denotes the additive model, and K > describe landscapes with increasing number of local minima. While widelystudied in the literature as they share qualitative features with empirical landscapes(and can be used to study classes of landscapes [20, 42]), they don’t make good mod-els of specific landscapes. Hence, they are more suitable for evaluating search pro-cesses across many different classes, however, it is unclear which ones would bebiologically relevant, and even less clear if they are relevant for sequence design ina particular context. These models are simulators of the behavior of particular set of biological sequences,and they primarily focus on biophysical structure. The simulators are built on ther-modynamical principles or other domain knowledge. These landscapes show con-siderable epistasis and complexity akin to those observed in natural landscapes (at12east for specific regions of the landscape) [16, 34]. The drawback for these models isthat they work only for a subset of problems where the biophysical first-principlesare indeed sufficiently explanatory. An example of such problems would be ther-mostability of proteins, although not all thermostability problems are well-capturedby these methods.Two of the most famous examples of these simulators are those that simulateRNA sequence secondary structures (e.g. the Vienna package [43]), and those thatsimulate protein stability and structure (e.g. Rosetta [44]). While an algorithm likeRosetta can be an excellent tool for design of de-novo proteins [45, 46], those thatlive in deep energetic valleys, it is less suitable at predicting the stability of naturalsequences subject to perturbations, where the energy landscape is more subtle. Astheir strength, the type of bias that these models rely on are (by definition) close towell-established natural laws.
A class of fitness landscape models, made available by the recent advances in deepmutational assays are fully data-dependent. These models, often black-box, aretrained on data (both unsupervised and supervised), and are then used as an or-acle for estimating fitness. Unsupervised models make use of the vast amount ofunlabeled evolutionary sequence data that is available publicly to infer the fitnesslandscape (the Potts-model inspired work above also fall in this category) [47, 71,48]. The assumption fundamental to these models is that similar evolutionary se-quences are roughly capturing the same function. Supervised models depend onexpensive assays to collect labels [49, 5, 4, 50, 51, 17], and often the assays are notdesigned for the models to optimally learn. In between these two limits there aresemi-supervised approaches that try to take advantage of both sets of data [8, 9].While it is possible to learn good models of the fitness landscape with these tech-niques, these black-box models are often not invertible and hence even with a per-fect map of sequence to function, sequence design requires the ability to optimize.Additionally, the performance of these models are not uniform across the sequence13pace, and can vary drastically in different regions of the landscape. Therefore adesirable property of these models is to have them report their confidence in theirestimates. These uncertainty estimates are inherent to models like Gaussian Pro-cesses [49], but can also be achieved with neural networks by using techniques suchas ensembling or dropout[52].
4. What does a good solution look like?
In this segment, we briefly cover the definitions and useful metrics by which se-quence design algorithms may be evaluated. Often, sequence design is online , mean-ing that decisions have to be made as we collect information, and hence in retro-spect, better decisions might have been possible. In this context, an “online algo-rithm” is an approach that provides certain guarantees for performance, whereas a“ heuristic algorithm” is one that provides no such guarantees. Unless a particularregularity in the structure of the fitness landscape can be exploited, the majority ofalgorithms that we will describe provide no performance guarantees.Experimental biology is hard, and often to get any measurement of the “ground-truth” function of a sequence takes significant time and effort. It is possible to dothese measurements in batches of a certain size. Hence, if one has a budget to test atotal of B sequences, and a maximum batch size of b , at each time step, the processgets to use the information gathered so far to propose sequences for the next batch.Practically, batch sizes can be of up to for DNA synthesis, and each round ofexperiments can take a few months, and hence algorithms that perform better withfewer batches are more practical.At their core, these algorithms need to balance the exploration-exploitation trade-off, based on the number of experiments that are made available to them (the “hori-zon”). Exploration can be seen as an information gathering act on the landscape,and exploitation is optimization of the desired outcomes based on that informa-tion. 14 .1 Outcomes Sequence design algorithms are considered useful if they can find a large number ofdistinct high performing sequences. We will break this heuristic down to two parts,optimality and diversity. These can be measured in multiple ways, and we discusssome simple metrics here.For optimality, in most landscapes that these algorithms would be actually usedfor, critical information such as the best possible fitness y ∗ or the set of all local max-ima M is unknown (otherwise, why use an algorithm to find them). Without loss ofgenerality, we will assume that maximization is the objective, but it is noteworthythat while it is common to assume that the best sequence is the one with the highestvalue max( y ) , in reality the most desirable value of a trait is not necessarily the high-est value. For instance, binding a particular target may be desirable, but binding ittoo strongly may be less desirable than binding it at a moderate level.If the target values were known for a landscape, an algorithm can be consideredefficient in its own right if it could find sequences x i ∈ S (cid:15) such that || Φ( x i ) − y ∗ || < (cid:15) with probability p ≥ − γ for some acceptable tolerance (cid:15), γ ≥ and some boundedbudget B ( L, Σ) . Ideally, we would like | S (cid:15) | >> . However, we often do not have thisinformation about the landscape in hand.Without such information, one can assume the existence of a desired profile y ∗ and choose (cid:15) ad hoc (or perhaps anneal it) accordingly. Most approaches simplychoose a single dimensional y and assume y ∗ = ∞ , which translates to optimizingfor max( y ) for some bounded budget and batch size. An expansion would be toconsider the cardinality |S τ | , where S τ = { x i | Φ( x i ) > y τ } and y τ ∈ R is someminimum desired value. Notably, different algorithms may be better or worse indifferent budget and batch size regimes.While finding a good solution is a primary goal, finding multiple solutions thatare distinct is more desirable. This is because there are always reasons beyond thedesigner’s control that can make a particular solution inadmissible. Having higherdiversity hedges against this risk. Ideally, you don’t get diversity by perturbing yourbest solution, but find very different sequences that are good at doing the same15hing (find many maxima in multi-peaked fitness landscapes).Similar to the optimality case, if we knew M we could use the number of foundmaxima |M (cid:48) y τ | where M (cid:48) y τ ∈ M is the maxima found above fitness y τ by the algo-rithm as a measure of diversity.When we do not have access to the entire landscape, we can measure the di-versity of sequences in S τ . Unfortunately, summarizing the diversity of a set of se-quences can be challenging when information about peaks is not available. High-dimensional clustering is has its own set of heuristic approaches, and we will notdiscuss them here. Simple metrics would include average pairwise edit distance,metric distance on embedded spaces, clustering, and site-specific entropy. None ofthese would uniquely describe the diversity of the set, and each are subject to theirown drawbacks. Apart from the outcomes of interest described above, there are also properties of thealgorithm that would make it more practical or desirable. Depending on the type ofproblem, different properties could be prioritized.
We would like the algorithm to be efficient in two senses: (1) It should make fewerqueries q Φ to Φ (experimental samples) to achieve the same outcome and (2) Wewould like q Φ (cid:48) /q Φ to be small. This second property also relates to scalability as itdetermines the computational resources required for the design of each batch. To be able to use our full experimental bandwidth our approach should be able togenerate as many samples as our experimental batch of size B can accept. Further-more, algorithms should scale as L grows. Note that it is possible to have algorithmsthat have low q Φ and q Φ (cid:48) , but are not scalable. For instance, they may be unable toproduce enough diversity to fill a batch of sequences due to mode collapse.16 .2.3 Consistency If our models get better, our performance should also improve, i.e. E should makeuse of the information that is available to it. Ideally, E should be independent of the model noise and bias. The algorithms shouldn’tassume a particular type of noise or bias and we should be able to change the modelor underlying landscape without making the algorithms useless [53]. I.e. if weswitch the model class, we would still have a usable exploration algorithm. Closely related to independence, if the model is bad, (often because of misspecifica-tion [34]) the algorithm shouldn’t fail completely. I.e. we would like the “worst-case”performance to be reasonably robust[53].
In the computational sense, adaptivity denotes how many processes you can runin parallel. In experimental language, it captures the penalty for sampling N se-quences at once (in one batch) as opposed to sampling them sequentially. Themore adaptive a process is, the fewer samples need to taken serially, and hence,fewer batches are needed. We would like the algorithm to be reproducible. This means that it would be reason-ably robust to hyper-parameters, easy to implement, and doesn’t require extraordi-nary computational resources to perform well.It is noteworthy that having an approach that excels in all of these criteria isoften impossible. For instance, high adaptivity may come with the cost of lower in-dependence overall (high adaptivity can be achieved by having an excellent model17f the global landscape).
5. What approaches have been proposed to address thischallenge?
Due to the combinatorial explosion of the landscape size as a function of the se-quence length, the brute-force approach is feasible only on tiny landscapes or well-defined subsets of the fitness landscape, (e.g. all single mutants [17]). The gener-ative process can simply enumerate all possible candidates, and records nature’sresponse. This is the only model-free algorithm with no stochastic element incor-porated. All other well-known “deterministic” search schemes (E.g. BFS, DFS, BeamSearch, ...) employed in this context make some decisions stochastically due to thehigh-branching factor or have to limit the search to a very small and biased sliver ofthe sequence space.
On the other end of the spectrum, we can generate samples randomly. This is theequivalent of Bartel and Szostak experiments [15], where the generator is a randomsequence generator, and the critique is nature. Given the very low success rate ofthe generator, this approach is impractical for optimization unless the problem iseasy or the throughput is overwhelming.
Nature’s way of exploring fitness landscapes is by evolution. Evolution’s generatoris random perturbations on the population that has already been successful andsurvived so far, so while locally (almost) random, it contains a lot of informationthat was accumulated during the course of evolutionary history (This information18s used in many unsupervised algorithms). Once sequences are generated throughthis process, they are directly passed to nature to decide what survives.There two famous models of evolutionary processes, the Wright-Fisher process andthe Moran process[54]. In the Wright-Fisher process, at each step N offspring aresampled by selecting a parent with probability proportional to their fitness, andreplicating them (with some mutation rate µ ). The entire new generation is madeup of the offspring produced in the previous step, hence the generations are non-overlapping.The Moran process [54] differs in that at each step only one member of the cur-rent generation is sampled with probability proportional to their fitness and repli-cated, and one is removed at random. Hence the generations (every N replications)for the Moran process are overlapping. This results in a faster adaptation rate, as off-spring in a given generation may descend from other high-fitness offspring withinthat same generation.In both the Wright-Fisher and the Moran process, it is also possible to recombinewith other offspring (or parent). This is in principle controlled by two parameters:(i) r which denotes the probability of a cross-over between two sequences alreadychosen to be recombining (ii) ρ which is the number of recombinations per off-spring. It is common to set ρ = 1 , which enforces sex to be between only two parents(this is not the case with some viral reproduction and DNA shuffling approaches).For comparison to batched experiments, the Wright-Fisher process appears a moresuitable benchmark, however we expect natural populations to adapt at a rate closerto that of the Moran process (which is faster).However, a more subtle aspect of fitness landscapes is the resolution at whichtwo different traits would be considered different enough from evolution’s perspec-tive. Assuming at trait value y , we define the probability of sampling a mutant i forreproduction as p i = e βyi (cid:80) j e βyj where β is the intensity of selection. Hence the relativefitness can be written as e β ( y i − y j ) , and the minimum ∆ ij y = y i − y j for which se-lection overtakes genetic drift (effects from random sampling, independent of thephenotype) as the primary force is approximately β log(1 + 1 /N ) . In other words,19aving a larger population (or stronger selection intensity) results in a landscapewith higher resolution where a smaller difference in the trait value would be effec-tively “seen” by evolution.Hence, evolutionary processes like the Wright-Fisher can be simulated with fourparameters: population size N , mutation rate µ (which we define as the probabil-ity of an edit to a base or amino-acid), recombination rate r (implicitly ρ = 1 ), andselection strength β . Evolutionary processes are often simulated in the limit whereselection is strong and mutations occur infrequently (known as the strong selec-tion, weak mutation limit SSWM), such that the population is monomorphic andupdates one variant at a time (a variant appears and is selected to fixation, ratherthan having many competing variants in the population at once). This simplifiestheoretical analysis significantly. However, this setting is unsuitable for measureslike batch efficiency as you need many generations to achieve a high fitness, but isrelatively sample efficient if you only care about unique samples tested.Natural populations face the prospect of extinction. As a result, populationscannot tolerate very high mutation rates. For example small populations can quicklyrun out of good variants if the mutation rate is too high, resulting in further shrink-age of the population and exacerbation of the problem[55]. Even for large popula-tions (e.g. viral populations), natural organisms show mutation rates below what isknown as the error threshold defined as µ < L , where L is the size of the genome[56]. But in our fixed population size models with no minimum fitness criteria,something will eventually survive, and hence it is noteworthy that optimal muta-tion rates in these conditions are possibly above the error threshold. Aside fromthese limitations, evolutionary processes in nature often benefit from achieving ex-tremely large N , something that is far more limited in all the following strategies . Classic directed evolution techniques take the principles of evolutionary search de-scribed above and apply it to problems of interest in lab. However, they also elimi- However, in nature, the census population size is different (larger) than the equivalent (effective)population size N that we use in the WF process, due to factors like non-random mating. ρ >> and µ >> /L to generate farther diversity, as we get second attempts if a population ofmutants is completely dead. On the critique side, they often bias and enhance theselective force β toward a particular phenotype of interest. This is a double-edgedsword, on one hand this allows the power of evolution to be focused on a trait ofinterest, on the other hand it may result in deterioration of secondary traits that arealso important and multi-objective directed evolution is challenging (see [57] for amodern framework towards this).While these family of approaches have been extremely successful (and FrancisArnold won a Nobel prize for pioneering them [58]), it’s been obvious for a whilethat at least on the choice of the what population to use as parents, augmentingthem with a model (critique) can help with efficacy of the process [49, 59, 51]. While the baselines introduced above are ones that have been the state of the art formany decades, with the advent of DNA synthesis technologies and rapid increase inthe accessibility of machine learning, there is a lot of activity in the field of machinelearning for protein engineering[71, 51]. The ambition is to surpass evolutionarysearch in how quickly we can find good solutions [10], and further decrease the timeand effort required to do so (Note that all “model-free” approaches can be appliedto optimise on Φ (cid:48) instead of Φ ). Among the first studies that incorporated the entirety of the generate-model-exploreframework was a series of papers by R. J. Fox and collegues [60, 59, 61] which firstdeveloped the concepts on NK-landscapes and then tried them on empirical ones.In these papers, the authors used in vitro and in silico genetic algorithms (randommutation and recombination) as their baseline generator in combination with Par-21ial Least Squares (PLS) regression. In subsequent rounds used the weights of theirmodel to inform the synthesis of mutants, hence, the model was employed both onthe generator and critique side. Evolutionary algorithms are often a great choiceto use for optimising on the landscape estimate. They can also be used in tandemwith simple models (e.g. by structure [62]). The field of genetic and evolutionaryalgorithms is massive, and applied in many contexts [63, 64, 65]. We encourage thereader to explore the resources on quality-diversity algorithms linked in section 7.However, most of these efforts are done outside sequence design and have not beenre-applied to biology itself, partially because they historically preceded the techni-cal ability for direct synthesis. We have revisited this gap recently, and shown thatsimple and scalable evolutionary algorithms are competitive with the more mod-ern approaches in terms of performance, robustness, and consistency [66]. Theseresults suggest that evolutionary algorithms are still a benchmark to consider formodel-guided sequence design.
Sequential learning algorithms were developed in context where decisions have tobe made based on partial observation of the data, and where choices may furtherinform the algorithm about the problem. In this online setting, the agent (explorer)needs to navigate the trade-off between exploiting the current best known solu-tions, and exploring new solutions with the hope of improving on the best solution.Among the most applied approaches adapted from this class of algorithms isthe framework of Bayesian Optimisation (BO). BO algorithms are designed to opti-mize (possibly non-differentiable) black-box functions that are expensive to query(which is the setting we are in). Importantly, these algorithms make use of the un-certainty of model estimates to negotiate exploration vs. exploitation. As the algo-rithm starts out with no knowledge about the space we are trying to optimize on, aprior over the set of possible objective functions is assumed (hence Bayesian). Thegoal is then to update the prior belief with measurements to obtain a posterior dis-tribution over the set of functions. The function that approximates the posterior22istribution of y is known as the surrogate model (a more informative Φ (cid:48) ). Most BOapproaches employ Gaussian Processes as their surrogate (see this tutorial: [67]).An important aspect of optimization in this setting is to decide how to use un-certainty to prioritize measurements. In BO, this is termed as the “acquisition func-tion”. Some widely used acquisition functions are (i) Expected Improvement (EI)which picks the sample that has the highest expected improvement over the currentbest sample and (ii) Upper Confidence Bound (UCB) where the expected reward isaugmented with an additional “optimism” term, for instance, by adding µ r + kσ r where µ r is the mean posterior, and σ r is the standard deviation of the posterior,and k is some scalar. Samples in UCB are then chosen as the one with the highestvalue of µ r + kσ + r .In a pioneering study, Romero and colleagues [49] demonstrate the use of BOfor protein engineering. Many productive efforts have followed since [68, 69, 70]and are well covered in this review [71]. While BO is a principled approach for op-timization of black-box functions, it scales poorly in high-dimensional and high-throughput domains, such as those frequently encountered in sequence design.Specifically how to define the domain of optimization, adapt to batch setting, rep-resent sequence space, and choose good hyper-parameters can be challenging anddiffer significantly between experiments. Successful applications of BO to sequencedesign often require some domain art.In the recent decade, another domain of study with applications to black-boxoptimization has flourished: Reinforcement Learning (RL). RL algorithms learn toperform tasks by experience, hence their success is often dependent on whetherinteractions with the environment are cheap or if there is a good simulator of theenvironment in which they can practice (e.g. chess). The “agent” that is guided bythe algorithm interacts with the environment (or simulator), and observes rewardsfor taking different actions, over time, the agent learns to take actions with betterreward. In our setting however, good simulators often don’t exists, and samplingthe environment directly is very expensive.One exception is that of RNA secondary structure, for which good simulatorsexist (e.g. [43]). Eastman et al [72] take advantage of this to train reinforcement23earning agents that are able to fold RNA sequences into particular secondary struc-tures that are challenging to achieve. Most standard approaches to this probleminclude stochastic search (but generating random perturbations of current candi-dates), which results in inefficient sampling and reduced performance. Eastmanet al, instead use graph-convolutional neural network to propose samples basedon the networks knowledge of the simulated RNA-folding data, and train it with areinforcement learning algorithm known as Asynchronous Advantage Actor-Criticalgorithm (A3C).Another promising approach is to build locally accurate simulators and use themto train an RL agent. Angemueller et al [73] train a policy network (a network thatdecides what mutations to make), by simulating the fitness landscapes through anensemble of models. The models are trained on the measured data so far, and thosethat achieve high R in cross-validation are selected as “simulators”. Agents aretrained within this simulator up to the certain distance (determined by uncertaintyin model estimates) from where data exists. Additionally, to increase the diversityof proposed sequences, they add a penalty for proposing samples that close to pre-viously proposed samples (the closer they are, the higher the penalty).The drawback of reinforcement learning approaches is the high computational(and domain expertise) cost of using these algorithms. In particular, PPO and TRPOalgorithms are known to be highly implementation-sensitive [74]. Biological sequences show a significant degree of regularities (e.g. motifs appear insequences, which result in structural or functional properties). Hence, in principlethe space of sequences that we are interested in can be compressed into a differentrepresentation, ideally continuous (an “embedding”), where we can use gradient-based optimisation techniques, or at least reduce the dimensionality of the data.Due to intense interest, these embedding methods are rapidly evolving, as they arealso suitable for “semi-supervised” settings, where the embedding is learned usingunlabeled natural sequences, and subsequently supervised models are trained on24he smaller dataset (now embedded), with better performance.Schemes to embed biological sequences fall into several general categories, no-tably VAEs [75, 76, 47, 77, 78] and invertible generative models (e.g. Flows or real-NVPs) [79], as well as a plethora of increasingly promising models that have beenadapted from natural language processing [80, 8, 48, 81, 82]. However, the explo-ration strategies implemented on these learned embeddings follow familiar Monte-Carlo, hill-climbing, or Bayesian Optimisation schemes. I.e. these embeddings sim-plify the exploration problem itself, rather than employing a complex optimisationon the original sequence space.
The use of generative models to propose sequences with better properties is a nat-ural route towards achieving the objective of producing diverse sequences that wedid not observe within the training set. We highlight some of these approaches be-low.Brookes and Listgarten [83] approach the exploration problem by carefully pair-ing a generative model with a regressor (oracle) that guides the generator towardsit’s own optima. Their first algorithm, titled Design by Adaptive Sampling (DbAS),assumes access to a static Φ (cid:48) . Their algorithm works by training a generative model G θ on a set of sequences x , and generating a set of proposal sequences ˆ x ∼ G θ .They then use Φ (cid:48) to filter ˆ x for their high-performing sequences, retrain G θ and re-draw samples and iterate until convergence. This scheme is identical to the cross-entropy method with a VAE as the generative model (although they could use an-other generative model), an important optimization scheme [84]. Notably, the ora-cle is not updated during the process. I.e. out-of-the-box their process is describedwith two round of experiments in mind (a training set to make an oracle, and a re-sulting set proposed by the generative model), where they maximize the potentialgains from their oracle, given what it already knows. While it is trivial to repeat theprocess for multiple rounds, the process can be improved by incorporating infor-mation about how many rounds it will be used for.25n follow-up work [6], Brookes et al, aim to improve the robustness of DbAS by in-troducing CbAS. This is meant to address the pitfall in which Φ (cid:48) is biased, and givespoor estimates outside its training domain. The authors enforce a soft pessimismpenalty for samples that are very distinct from those that the oracle could have pos-sibly learned from. Specifically, they modify the DbAS paradigm such that as thegenerator updates its parameters θ → θ t while training on samples in the tail of thedistribution, it discounts the weight of the samples x i by P ( x i | G ; θ ) P ( x i | G ; θ t ) . In other words,if the generative model that was trained on the original data was more enthusiasticabout a sample than the one that has updated according to the oracle’s recommen-dations, that data point is up-weighted in the next training round (and vice versa ).Feedback-GANs (FBGAN) [85] approach sequence design using a GAN, which con-sisted of a generator G , and a discriminator D . In a standard GAN the generator’sobjective is to use a latent code z (often white noise) to produce samples x that areclose to training data. The discriminator’s objective is to discern if the x was gener-ated by G or came from the ground truth distribution R (i.e. D ( x i ) = P ( x i ∈ R )) .By jointly training G, D one can generate samples similar to those in the trainingset. Note however, that the discriminator does not necessarily provide an estimateof y . To allow for optimizing y , the authors pair the GAN with an oracle Φ (cid:48) . At eachepoch, multiple samples generated by G are passed through the oracle to acquiretheir label. Sequences with y > T (e.g. top quantile), are then presented to the dis-criminator for training. This ensures that over time, both D, G are biased towardsproposing high-performing samples.Killoran et al [86] pursue the same objective by regularizing an “inverted model”, Φ (cid:48)− ( y ) = x through a generative adversarial network G . Activity maximization isa process where the input x is perturbed in small steps such that it improves theobjective y , i.e. x ( t +1) = x ( t ) + α ∇ x y ( α is the step size). For sequences, due to theirnon-continuous nature, this involves continuous relaxation schemes of the inputthrough casting them as probabilities: going from one-hot representations of se-quences to position weight matrices (PWM). Activity maximization processes comewith potential drawbacks however: (i) They can produce non-realistic sequencesthat do not look like those in the training data, (ii) they can be computationally26xpensive. The authors address the first problem by pairing the model with a gen-erator (in their case a Wasserstein GAN) that is trained to produce samples similarto those that have the desired properties in the training set. The WGAN accepts alatent code z i to generate sequence x i ∼ G ( z i ) , and then pass them to Φ (cid:48) ( x i ) = y .The optimization recipe then becomes z ( t +1) = z ( t ) + α ∇ z y where ∇ z y = (cid:88) d ∂ y ∂x d ∂x d ∂z where d represents the dimensions of x .Deep Exploration Networks (DENs) [87], take a similar architecture as Killoran etal. but focus on optimizing the generator G only. They assume access to an oracle Φ (cid:48) (pre-trained) that they do not optimize during training (i.e. the oracle trainingtime is the offline overhead, which saves a lot of time compared to computing AMonline). The main innovative aspect of this work is to force the generator to competewith itself to maintain diversity: a known pathology of GANs is to ignore the latentcode and undergo “mode-collapse” (where the generator simply produces one typeof sequence). To achieve this, the authors provide distinct latent codes to G , andpenalize outputs x ∼ G ( z ) , x ∼ G ( z ) that are similar. The cost function for thisapproach is defined as : C = C objective (cid:16) Φ (cid:48) ( G ( z )) (cid:17) + C diversity (cid:16) G ( z ) , G ( z ) (cid:17) which they seek to minimize by training G . This approach circumvents some ofthe drawbacks of AM, notably the computational cost of producing sequences. Asimpler architecture with fast convergence has been proposed in follow up work[88]. Of course, one is not restricted to use only one of the methods above to propose se-quences. Angermueller et al. [89] propose a population-based method (P3BO) thatensembles exploration strategies together. The idea is very simple: at first many27xploration strategies are given equal shares of the budget for each batch. Subse-quently, given the performance of the sequences each proposed, the budget allo-cated to each method is updated, rewarding algorithms with better proposals. Theyshow that this ensembling generally outperforms the baselines that only includesingle methods both in terms of diversity and optimality.Another innovation that can be thought of as a meta-algorithm, applicable tomany generative exploration strategies, is termed auto-focusing [90]. When a modelis trained based on some collected data, and new samples are desired for the nextbatch that somehow optimize the properties of the input. Often, as discussed in thisprimer, the trained model is misspecified and starts to become inaccurate as theinputs deviate from those present in the training data. Search algorithms discussedabove that query such models define a “trust region” (which can be soft or hard)and only accept designed inputs if they fall within that region. Auto-focusing is arecipe to instead retrain the predictive model using importance weighting, coupledwith updates to the parameters of the search model (i.e. the generative explorer) inlock-step, to “focus” the model on the region of the search space that the exploreris visiting (i.e. improve the model’s accuracy in that region), thereby improving thequality of the model (reducing its gap from the ground truth) in the region that theexplorer is sampling.
In this section, we list some considerations that are occasionally ignored in studiesthat have significance for evaluation and designing algorithms that can be used inpractice.
Ground truth oracles that are only proxies of ground truth (e.g. machine learningmodel of the landscape), can behave pathologically outside their training domain[91, 6, 73]. Algorithms that explore spaces built on these models can overfit in suchlandscapes: the biases introduced by these oracles can make an algorithm appear28ood in practice, when it is in fact a good optimizer for the model.It is hence advisable to use ground-truth oracles that with consistent behavioracross the entire exploration domain[66]. When not possible, a compromise wouldbe to ensure that the algorithm’s internal model is not from the same class as that ofthe surrogate oracle [67].
When synthesis is the way in which sequences are generated, populations of se-quences shouldn’t contain duplicates, either within or between generations (exceptfor experimental controls). Experimentalists don’t need to re-measure sequencesonce they have a good measurement. This could artificially inflate diversity, andmean fitness measures. Natural and model-free directed evolution benchmarks arean exception.
A challenge that sequence design algorithms face in practical settings is that gen-erating samples with aggressive changes may result in a batch of completely non-functional sequences. This would be costly and doesn’t add much information tothe models in order to improve the guesses for the next round. However, when testlandscapes contain gradients everywhere, mistakes are not as costly, and algorithmscan always re-initiate at a random place and optimize successfully. Testing on land-scapes of this second kind only, can give results that are not generalizable to morerealistic landscapes where vast areas of the sequence space contain no gradients(we term this as “swampland”, but they’ve also mentioned as “holey” [49]).
This is related to the previous pitfall. Many realistic landscapes are swamplands,where random samples of sequences will not exhibit any function at all [15]. Insuch landscapes it is obvious to the designer that they need to start at sequencesthat already show some desired activity. However, this means that those sequences29ay already be at or near a peak. Algorithms that are very good hill climbers maynot be great at escaping local peaks. Hence even when landscapes have gradienteverywhere, it is advisable to test the algorithm on both high-performing startingsamples as well as random (often poor) samples.
As discussed in section 5.1.3, the selective pressure and population size are highlyimportant in the efficiency of evolutionary exploration. In particular, for any givenpopulation size, there fitness landscape is coarse-grained with respect to the mini-mum difference in fitness ∆ y under which selection overtakes drift. In other words,evolution cannot “see” the difference between mutants that have ∆ y << β log(1 +1 /N ) . For this reason, one measure of efficiency or optimally, can be to state howfast a population reaches the same fitness that an evolving population with size N and selection intensity β would.
6. Conclusions
Sequence design with the aid of machine learning is a booming field that spansmultiple disciplines. The topics covered in this primer are meant to educate andhelp initiate research from experts and beginners across these fields, and ease theirintroduction to this topic. This primer does not paint a full picture, and in partic-ular, is light on its treatment of computational protein design. Additionally, giventhe rapid expansion of the field, we did not attempt to cover all recent progress inthis domain. However, we hope this resource removes some of barriers for thoseentering the field. Fortunately, there is a lot still to be done, and we look forward tothe upcoming progress. 30 . Further resources • A list of quality-diversity algorithms: https://quality-diversity.github.io/papers • A regularly updated repository of papers related to Model-guided protein de-sign: https://github.com/yangkky/Machine-learning-for-proteins .• A sandbox for evaluating sequence design algorithms: https://github.com/samsinai/FLEXS .
8. Acknowledgements
We thank Jeff Gerold, Lauren Wheelock, Carl Veller, Gleb Kuznetsov, Surge Biswas,Martin Nowak, George Church, and members of Dyno Therapeutics for helpful dis-cussions. 31 eferences [1] Sewall Wright.
The roles of mutation, inbreeding, crossbreeding, and selectionin evolution , volume 1. na, 1932.[2] J Arjan Gm De Visser and Joachim Krug. Empirical fitness landscapes and thepredictability of evolution.
Nature Reviews Genetics , 15(7):480, 2014.[3] Inna S Povolotskaya and Fyodor A Kondrashov. Sequence space and the ongo-ing expansion of the protein universe.
Nature , 465(7300):922, 2010.[4] Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova,Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova,Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of thegreen fluorescent protein.
Nature , 533(7603):397, 2016.[5] Surojit Biswas, Gleb Kuznetsov, Pierce J Ogden, Nicholas J Conway, Ryan PAdams, and George M Church. Toward machine-guided design of proteins. bioRxiv , page 337154, 2018.[6] David H Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning byadaptive sampling for robust design. arXiv preprint arXiv:1901.10060 , 2019.[7] Jakub Otwinowski, David Martin McCandlish, and Joshua Plotkin. Inferringthe shape of global epistasis. bioRxiv , page 278630, 2018.[8] Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, andGeorge M Church. Unified rational protein engineering with sequence-onlydeep representation learning. bioRxiv , page 589333, 2019.[9] Surojit Biswas, Grigory Khimulya, Ethan C Alley, Kevin M Esvelt, and George MChurch. Low-n protein engineering with data-efficient deep learning. bioRxiv ,2020.[10] Eric D Kelsic and George M Church. Challenges and opportunities of machine-guided capsid engineering for gene therapy.
Cell and Gene Therapy Insights ,2019. 3211] J Arjan GM de Visser, Santiago F Elena, Inˆes Fragata, and Sebastian Ma-tuszewski. The utility of fitness landscapes and big data for predicting evo-lution, 2018.[12] Sergey Kryazhimskiy, Daniel P Rice, Elizabeth R Jerison, and Michael M Desai.Global epistasis makes adaptation predictable despite sequence-level stochas-ticity.
Science , 344(6191):1519–1522, 2014.[13] Claudia Bank, Sebastian Matuszewski, Ryan T Hietpas, and Jeffrey D Jensen.On the (un) predictability of a large intragenic fitness landscape.
Proceedingsof the National Academy of Sciences , 113(49):14085–14090, 2016.[14] Lynn Margulis and Dorion Sagan.
Microcosmos: Four billion years of microbialevolution . Univ of California Press, 1997.[15] D.P. Bartel and Jack Szostak. Bartel, d.p. and szostak, j.w. isolation of new ri-bozymes from a large pool of random sequences. science 261, 1411-1418.
Sci-ence (New York, N.Y.) , 261:1411–8, 10 1993.[16] Louis du Plessis, Gabriel E Leventhal, and Sebastian Bonhoeffer. How good arestatistical models at approximating complex fitness landscapes?
Molecularbiology and evolution , 33(9):2454–2468, 2016.[17] Pierce J Ogden, Eric D Kelsic, Sam Sinai, and George M Church. Comprehen-sive aav capsid fitness landscape reveals a viral gene and enables machine-guided design.
Science , 366(6469):1139–1143, 2019.[18] Herbert S Wilf and Warren J Ewens. There’s plenty of time for evolution.
Pro-ceedings of the National Academy of Sciences , 107(52):22454–22456, 2010.[19] Atish Agarwala and Daniel S Fisher. Adaptive walks on high-dimensional fit-ness landscapes and seascapes with distance-dependent statistics.
Theoreticalpopulation biology , 130:13–49, 2019.[20] Artem Kaznatcheev. Computational complexity as an ultimate constraint onevolution.
Genetics , 212(1):245–265, 2019.3321] Krishnendu Chatterjee, Andreas Pavlogiannis, Ben Adlam, and Martin ANowak. The time scale of evolutionary innovation.
PLoS computational bi-ology , 10(9), 2014.[22] Leslie G Valiant. Evolvability.
Journal of the ACM (JACM) , 56(1):1–21, 2009.[23] Artem Kaznatcheev. Evolution is exponentially more powerful with frequency-dependent selection. bioRxiv , 2020.[24] Michael J McDonald, Daniel P Rice, and Michael M Desai. Sex speeds adapta-tion by altering the dynamics of molecular evolution.
Nature , 531(7593):233–236, 2016.[25] Sam Sinai, Jason Olejarz, Iulia A Neagu, and Martin A Nowak. Primordial sexfacilitates the emergence of evolution.
Journal of The Royal Society Interface ,15(139):20180003, 2018.[26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-sarial nets. In
Advances in neural information processing systems , pages 2672–2680, 2014.[27] Benjamin H Good, Michael J McDonald, Jeffrey E Barrick, Richard E Lenski,and Michael M Desai. The dynamics of molecular evolution over 60,000 gen-erations.
Nature , 551(7678):45, 2017.[28] Sandeep Venkataram, Barbara Dunn, Yuping Li, Atish Agarwala, Jessica Chang,Emily R Ebel, Kerry Geiler-Samerotte, Lucas H´erissant, Jamie R Blundell,Sasha F Levy, et al. Development of a comprehensive genotype-to-fitness mapof adaptation-driving mutations in yeast.
Cell , 166(6):1585–1596, 2016.[29] Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen SGisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock,Sachi Inukai, et al. Survey of variation in human transcription factors revealsprevalent dna binding changes.
Science , 351(6280):1450–1454, 2016.3430] Frank J Poelwijk, Michael Socolich, and Rama Ranganathan. Learning the pat-tern of epistasis linking genotype and phenotype in a protein.
Nature commu-nications , 10(1):1–11, 2019.[31] Jos´e Aguilar-Rodr´ıguez, Joshua L Payne, and Andreas Wagner. A thousand em-pirical adaptive landscapes and their navigability.
Nature ecology & evolution ,1(2):1–9, 2017.[32] Zachary R Sailer and Michael J Harms. Detecting high-order epistasis in non-linear genotype-phenotype maps.
Genetics , pages genetics–116, 2017.[33] Thomas A Hopf, John B Ingraham, Frank J Poelwijk, Charlotta PI Sch¨arfe,Michael Springer, Chris Sander, and Debora S Marks. Mutation effects pre-dicted from sequence co-variation.
Nature biotechnology , 35(2):128, 2017.[34] Jakub Otwinowski and Joshua B Plotkin. Inferring fitness landscapes by re-gression produces biased estimates of epistasis.
Proceedings of the NationalAcademy of Sciences , page 201400849, 2014.[35] Zachary R Sailer and Michael J Harms. Uninterpretable interactions: epistasisas uncertainty. bioRxiv , page 378489, 2018.[36] Ammar Tareen, William Thornton Ireland, Anna Posfai, David Martin McCan-dlish, and Justin Block Kinney. Mave-nn: Quantitative modeling of genotype-phenotype maps as information bottlenecks.
BioRxiv , 2020.[37] Abe D Pressman, Ziwei Liu, Evan Janzen, Celia Blanco, Ulrich F Mueller, Ger-ald F Joyce, Robert Pascal, and Irene A Chen. Mapping a systematic ri-bozyme fitness landscape reveals a frustrated evolutionary network for self-aminoacylating rna.
Journal of the American Chemical Society , 141(15):6213–6223, 2019.[38] Debora S Marks, Lucy J Colwell, Robert Sheridan, Thomas A Hopf, Andrea Pag-nani, Riccardo Zecchina, and Chris Sander. Protein 3d structure computedfrom evolutionary sequence variation.
PloS one , 6(12), 2011.3539] Magnus Ekeberg, Cecilia L¨ovkvist, Yueheng Lan, Martin Weigt, and Erik Au-rell. Improved contact prediction in proteins: using pseudolikelihoods to inferpotts models.
Physical Review E , 87(1):012707, 2013.[40] Jaclyn K Mann, John P Barton, Andrew L Ferguson, Saleha Omarjee, Bruce DWalker, Arup Chakraborty, and Thumbi Ndung’u. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions byin vitro testing.
PLoS computational biology , 10(8), 2014.[41] Edward D Weinberger. Fourier and taylor series on fitness landscapes.
Biolog-ical cybernetics , 65(5):321–330, 1991.[42] Stuart A Kauffman and Edward D Weinberger. The nk model of rugged fitnesslandscapes and its application to maturation of the immune response.
Journalof theoretical biology , 141(2):211–245, 1989.[43] Ronny Lorenz, Stephan H Bernhart, Christian H¨oner Zu Siederdissen, HakimTafer, Christoph Flamm, Peter F Stadler, and Ivo L Hofacker. Viennarna pack-age 2.0.
Algorithms for molecular biology , 6(1):26, 2011.[44] Carol A Rohl, Charlie EM Strauss, Kira MS Misura, and David Baker. Proteinstructure prediction using rosetta. In
Methods in enzymology , volume 383,pages 66–93. Elsevier, 2004.[45] Andrew H. Ng, Taylor H. Nguyen, Mariana G´omez-Schiavon, Galen Dods,Robert A. Langan, Scott E. Boyken, Jennifer A. Samson, Lucas M. Waldburger,John E. Dueber, David Baker, and Hana El-Samad. Modular and tunable bio-logical feedback control using a de novo protein switch.
Nature , 2019.[46] Hao Shen, Jorge A. Fallas, Eric Lynch, William Sheffler, Bradley Parry, NicholasJannetty, Justin Decarreau, Michael Wagenbach, Juan Jesus Vicente, JiajunChen, Lei Wang, Quinton Dowling, Gustav Oberdorfer, Lance Stewart, LindaWordeman, James De Yoreo, Christine Jacobs-Wagner, Justin Kollman, andDavid Baker. De novo design of self-assembling helical protein filaments.
Sci-ence , 362(6415):705–709, 2018. 3647] Adam J Riesselman, John B Ingraham, and Debora S Marks. Deep gener-ative models of genetic variation capture mutation effects. arXiv preprintarXiv:1712.06527 , 2017.[48] Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott,C Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and func-tion emerge from scaling unsupervised learning to 250 million protein se-quences. bioRxiv , page 622803, 2019.[49] Philip A Romero, Andreas Krause, and Frances H Arnold. Navigating the pro-tein fitness landscape with gaussian processes.
Proceedings of the NationalAcademy of Sciences , 110(3):E193–E201, 2013.[50] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey.Predicting the sequence specificities of dna-and rna-binding proteins by deeplearning.
Nature biotechnology , 33(8):831, 2015.[51] Zachary Wu, SB Jennifer Kan, Russell D Lewis, Bruce J Wittmann, andFrances H Arnold. Machine learning-assisted directed protein evolution withcombinatorial libraries.
Proceedings of the National Academy of Sciences ,116(18):8852–8858, 2019.[52] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple andscalable predictive uncertainty estimation using deep ensembles. In
Advancesin neural information processing systems , pages 6402–6413, 2017.[53] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithmsvia ml predictions. In
Advances in Neural Information Processing Systems ,pages 9661–9670, 2018.[54] Martin A Nowak.
Evolutionary dynamics . Harvard University Press, 2006.[55] Wilfried Gabriel, Michael Lynch, and Reinhard B ¨urger. Muller’s ratchet andmutational meltdowns.
Evolution , 47(6):1744–1757, 1993.3756] Manfred Eigen. Selforganization of matter and the evolution of biologicalmacromolecules.
Naturwissenschaften , 58(10):465–523, 1971.[57] Armita Nourmohammad and Ceyhun Eksin. Optimal evolutionary control forartificial selection on molecular phenotypes. arXiv preprint arXiv:1912.13433 ,2019.[58] Frances H Arnold. Design by directed evolution.
Accounts of chemical research ,31(3):125–131, 1998.[59] Richard Fox. Directed molecular evolution by machine learning and the influ-ence of nonlinear interactions.
Journal of theoretical biology , 234(2):187–199,2005.[60] Richard Fox, Ajoy Roy, Sridhar Govindarajan, Jeremy Minshull, Claes Gustafs-son, Jennifer T Jones, and Robin Emig. Optimizing the search algorithm forprotein engineering by directed evolution.
Protein engineering , 16(8):589–597,2003.[61] Richard J Fox, S Christopher Davis, Emily C Mundorff, Lisa M Newman, VesnaGavrilovic, Steven K Ma, Loleta M Chung, Charlene Ching, Sarena Tam, SheelaMuley, et al. Improving catalytic function by prosar-driven enzyme evolution.
Nature biotechnology , 25(3):338, 2007.[62] Claire N Bedbrook, Austin J Rice, Kevin K Yang, Xiaozhe Ding, Siyuan Chen,Emily M LeProust, Viviana Gradinaru, and Frances H Arnold. Structure-guidedschema recombination generates diverse chimeric channelrhodopsins.
Pro-ceedings of the National Academy of Sciences , 114(13):E2624–E2633, 2017.[63] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evo-lution strategies as a scalable alternative to reinforcement learning. arXivpreprint arXiv:1703.03864 , 2017.[64] Kalyanmoy Deb.
Multi-objective optimization using evolutionary algorithms ,volume 16. John Wiley & Sons, 2001.3865] Thomas Back.
Evolutionary algorithms in theory and practice: evolution strate-gies, evolutionary programming, genetic algorithms . Oxford university press,1996.[66] Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane,and Eric Kelsic. Adalead: A simple and robust adaptive greedy search algorithmfor sequence design. arXiv preprint arXiv:2010.02141 , 2020.[67] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian opti-mization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 , 2010.[68] Claire N Bedbrook, Kevin K Yang, Austin J Rice, Viviana Gradinaru, andFrances H Arnold. Machine learning to design integral membrane channel-rhodopsins for efficient eukaryotic expression and plasma membrane local-ization.
PLoS computational biology , 13(10):e1005786, 2017.[69] Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence.Bayesian optimization for synthetic gene design. arXiv preprintarXiv:1505.01627 , 2015.[70] Kevin K Yang, Yuxin Chen, Alycia Lee, and Yisong Yue. Batched stochasticbayesian optimization via combinatorial constraints design. arXiv preprintarXiv:1904.08102 , 2019.[71] Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guideddirected evolution for protein engineering.
Nature methods , 16(8):687–694,2019.[72] Peter Eastman, Jade Shi, Bharath Ramsundar, and Vijay S Pande. Solving therna design problem with reinforcement learning.
PLoS computational biology ,14(6):e1006176, 2018.[73] Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande,Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for bi-39logical sequence design. In
International Conference on Learning Representa-tions , 2020.[74] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, FirdausJanoos, Larry Rudolph, and Aleksander Madry. Implementation matters indeep RL: A case study on PPO and TRPO. In
International Conference on Learn-ing Representations , 2019.[75] Rafael G´omez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos´e MiguelHern´andez-Lobato, Benjam´ın S´anchez-Lengeling, Dennis Sheberla, JorgeAguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Al´an Aspuru-Guzik. Automatic chemical design using a data-driven continuous representa-tion of molecules.
ACS central science , 4(2):268–276, 2018.[76] Sam Sinai, Eric Kelsic, George M Church, and Martin A Nowak. Variationalauto-encoding of protein sequences. arXiv preprint arXiv:1712.03346 , 2017.[77] Xinqiang Ding, Zhengting Zou, and Charles L Brooks III. Deciphering proteinevolution and fitness landscapes with latent space models.
Nature Communi-cations , 10(1):1–13, 2019.[78] Joe G Greener, Lewis Moffat, and David T Jones. Design of metalloproteins andnovel protein folds using variational autoencoders.
Scientific reports , 8(1):1–12,2018.[79] Frank No´e, Simon Olsson, Jonas K¨ohler, and Hao Wu. Boltzmann generators:Sampling equilibrium states of many-body systems with deep learning.
Sci-ence , 365(6457):eaaw1147, 2019.[80] Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learnedprotein embeddings for machine learning.
Bioinformatics , 34(15):2642–2648,2018.[81] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, JohnCanny, Pieter Abbeel, and Yun Song. Evaluating protein transfer learning with40ape. In
Advances in Neural Information Processing Systems , pages 9686–9698,2019.[82] Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand,Raphael R Eguchi, Po-Ssu Huang, and Richard Socher. Progen: Language mod-eling for protein generation. arXiv preprint arXiv:2004.03497 , 2020.[83] David H Brookes and Jennifer Listgarten. Design by adaptive sampling. arXivpreprint arXiv:1810.03714 , 2018.[84] Reuven Rubinstein. The cross-entropy method for combinatorial and con-tinuous optimization.
Methodology and computing in applied probability ,1(2):127–190, 1999.[85] Anvita Gupta and James Zou. Feedback GAN (FBGAN) for DNA: a novelfeedback-loop architecture for optimizing protein functions. arXiv preprintarXiv:1804.01694 , 2018.[86] Nathan Killoran, Leo J Lee, Andrew Delong, David Duvenaud, and Brendan JFrey. Generating and designing DNA with deep generative models. arXivpreprint arXiv:1712.06148 , 2017.[87] Johannes Linder, Nicholas Bogard, Alexander B Rosenberg, and Georg Seelig.A generative neural network for maximizing fitness and diversity of syntheticdna and protein sequences.
Cell Systems , 2020.[88] Johannes Linder and Georg Seelig. Fast differentiable dna and protein se-quence optimization for molecular design. arXiv preprint arXiv:2005.11275 ,2020.[89] Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, DavidDohan, Kevin Murphy, Lucy Colwell, and D Sculley. Population-basedblack-box optimization for biological sequence design. arXiv preprintarXiv:2006.03227 , 2020. 4190] Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-baseddesign. arXiv preprint arXiv:2006.08052 , 2020.[91] Aviral Kumar and Sergey Levine. Model inversion networks for model-basedoptimization. arXiv preprint arXiv:1912.13464arXiv preprint arXiv:1912.13464