Bayesian Optimization for Synthetic Gene Design
Javier González, Joseph Longworth, David C. James, Neil D. Lawrence
BBayesian Optimization for Synthetic Gene Design
Javier Gonz´alez , , , Joseph Longwoth , David C. James and Neil D. Lawrence , Department of Computer Science Department of Chemical and Biological Engineering Sheffield Institute for Translational Neuroscience. University of Sheffield { j.h.gonzalez,j.longworth,d.c.james,n.lawrence } @sheffield.ac.uk Abstract
We address the problem of synthetic gene design using Bayesian optimization.The main issue when designing a gene is that the design space is defined in termsof long strings of characters of different lengths, which renders the optimizationintractable. We propose a three-step approach to deal with this issue. First, weuse a Gaussian process model to emulate the behavior of the cell. As inputs ofthe model, we use a set of biologically meaningful gene features, which allowsus to define optimal gene designs rules . Based on the model outputs we definea multi-task acquisition function to optimize simultaneously severals aspects ofinterest. Finally, we define an evaluation function , which allow us to rank sets ofcandidate gene sequences that are coherent with the optimal design strategy. Weillustrate the performance of this approach in a real gene design experiment withmammalian cells.
Synthetic biology concerns with the design and construction of new biological elements of livingsystems and the re-design of existing ones for useful purposes [4]. In this context, there is a currentinterest in the development of new methods to engineer living cells in order to produce compoundsof interest [11]. A modern approach to this problem is the use of synthetic genes, which once‘inserted’ in the cells can modify their natural behavior activating the production of proteins usefulfor further pharmaceutical purposes.We present the first approach for gene design based on Bayesian optimization (BO) principles. TheBO framework [14, 3, 10, 6] allows us to explore the gene design space in order to provide rules tobuild genes with interesting properties, such as genes that are able to produce proteins of interest,or genes able to act on the cell lifespan. We use a Gaussian process [12] to emulate the complexbehavior of the cell across the different gene designs. An acquisition function is optimized to dealwith exploration-exploitation trade-off. To provide, not only rules for gene design, but current genesequences candidates, we introduce the concept of evaluation function . The goal of this functionsis to avoid the bottleneck of optimizing over the sequences by providing rules to rank biologicallyfeasible genes coherent with the obtained gene design rules.Although in this work we focus on the optimization of the translational efficiency of the cells, thisframework can be generalized to multiple synthetic biology design problems, such as the optimiza-tion of media design, or the optimization of multiple gene knock-out strategies.
Broadly speaking, in molecular biology it is assumed that a gene contains the information to encodea mRNA molecule. The production of such molecules is called transcription and takes place in1 a r X i v : . [ s t a t . M L ] M a y igure 1: Top : Central dogma of molecular biology. Each gene contains the information to encodean mRNA molecule which is used by the cell to produce proteins. Both, mRNA molecules andproteins are produced at certain rate. The goal is to design genes sequences able to increase the rateswhile encoding the same protein.
Bottom left : Graph of codons redundancy. Letters inside the circlerepresent DNA basis and letters outside represent amino acids. The arcs outside the circle cover thepaths of redundant codons. † and (cid:63) represent special codons. Bottom right : Graphical model of themulti-output Gaussian process used to emulate the cell behavior in this work.the cell nucleus at certain rate y α . Later on, the mRNA molecules are used to produce proteins ata different rate y β , in what is called the translation phase. A one-to-one correspondence betweengenes and proteins is assumed. Each gene, itself, is also made up of a sequence s of several hundredsof bases ( A, T, G, C ) , triplets of which, form the so-called codons. We can interpret the codons likethe ‘words’ in which the genetic code is written. The 64 possible codons encode 20 amino acids,which are the fundamental elements that the cell uses to produce proteins. This means that thegenetic code is redundant: the same aminoacid can be encoded by different codons and thereforethere exist multiple ways of encoding the same protein. See Figure 1 (top and bottom left) for anillustration of this process. A fundamental of gene design is that redundant codons choices do notaffect the type of protein that is being encoded but they may affect the rates y α , y β , and thereforethe efficiency at which it is produced.Consider a p-dimensional representation x ∈ IR p of a gene sequence s . Such a representationwill typically be the frequency of the different codons but it may also include other variables likethe length of the sequence, or the times a certain pattern is repeated across the gene. Denote by f α , f β : X → IR × IR the functions representing the expected transcription and translation ratesgiven a sequence with features x ∈ X . We want to solve the global optimization problem of findingthe sequence that maximizes both rates. However, this requires optimization across the all possiblesequences. This is infeasible due to the high dimensionality. Instead, we aim to solve the surrogatemuti-objective problem of finding x (cid:63) = arg max x ∈X ( f α ( x ) , f β ( x )) , to later connect x (cid:63) with aparticular gene design. Let { s , . . . , s N } be a set of gene sequences. Consider a p-dimensional feature representation of thesequences given by { x , . . . , x N } where x i ∈ IR p . Let y α , y β ∈ IR N be the observed transcriptionand translation rates. Our first goal is to learn from a model of the combined data D = {D α , D β } ,where D α = { x i , y α,i } Ni =1 and D β = { x i , y β,i } Ni =1 , how to predict the value of the output functions f α ( x ) and f β ( x ) at any x ∈ IR p . For simplicity we assume here that both rates are available for allthe sequences, but this assumption can be easily relaxed.2 lgorithm 1 Bayesian optimization for gene design
Extract features { x , . . . , x N } from the available gene sequences { s , . . . , s N } .Take D = {D α , D β } , where D α = { x i , y α,i } Ni =1 and D β = { x i , y β,i } Ni =1 . for t = 1 , , . . . do Fit a multi-output GP model using D t .Obtain design rules by taking x t +1 = arg max x ∈X acqu ( x |D t ) . Generate a set of candidate gene sequences S .Rank the sequences in S and select s t +1 = arg min s ∈S eval ( s | x t +1 ) .Run experiment using s t +1 and extract features x t +1 from the sequence s t +1 .Augment the data D t +1 = {D t , ( x t +1 , ( y α,t +1 , y β,t +1 )) } . end forReturns : Optimal gene design s (cid:63) . A Gaussian process (GP) is a stochastic process with the property that each linear finite-dimensionalrestriction is multivariate Gaussian [12]. GPs are typically used as prior distribution over functions.In the simple output case, the random variables are associated to a single process f evaluated atdifferent x but, this can be easily generalized to multiple outputs, where the random variables areassociated to different processes { f l } dl =1 . In our case we have d = 2 , which correspond to the tworates of interest. We work therefore with the vector-value function f := ( f α , f β ) , which is assumedto follow a GP f ≈ GP ( m , K ) where m is a 2-dimensional vector whose components are the meanfunctions m α , m β of each output and K is a positive matrix valued function that acts directly on inputexample and tasks indices. The entries ( K ( x , x (cid:48) )) l,l (cid:48) in K ( x , x (cid:48) ) represent the covariance between f α ( x ) and f β ( x (cid:48) ) . Under a Gaussian likelihood assumption, the predictive distribution for a newvector x ∗ is taken to be Gaussian such that p ( f ( x ∗ ) |D , f , x ∗ , φ ) = N ( f ∗ ( x ∗ ) , K ∗ ( x ∗ , x ∗ )) where f ∗ ( x ∗ ) and K ∗ ( x ∗ , x ∗ ) are close expressions that depend on the set of input X and the kernel K . See[12, 1] for details. φ represents all the parameters of the kernel, which can be built following variousstrategies. In this work we use a combination of the linear and the intrinsic corregionalization models[16, 1]. We take K ( X , X ) = B ⊗ K lin ( X , X ) + B ⊗ K se ( X , X ) , where K lin is a linear kernel usedto account for the different levels of the rates and K se a square exponential kernel with a differentlengthscale per dimension. B lin , B se are the corregionalization matrices, which are parametrized as B lin = w lin w Tlin + κ lin I and B se = w se w Tse + κ se I for w lin , w se , κ lin , κ se ∈ IR and I is theidentity matrix of dimension 2. ⊗ represents the Hadamard product. See Figure 1 (bottom right) fora graphical description of the model. In multi-task optimization problems a typical issue is to deal with potential conflicting objectives,or tasks that cannot be optimized simultaneously; in our case this means that both rates cannot beoptimized simultaneously using the same sequence. Following previous work in multi-task Bayesianoptimization [15], here we focus on an acquisition function that maximizes the average of the tasks.The predictive mean and variance of the average objective are ¯ m ( x ) = (cid:80) l = α,β f ∗ ( x ) , and ¯ σ ( x ) = (cid:80) l = α,β (cid:80) l (cid:48) = α,β ( K ∗ ( x , x )) l,l (cid:48) . Both ¯ m ( x ) and ¯ σ ( x ) can be used in a standard way using anyacquisition function acq ( x ) , such as the expected improvement (EI) [14].Consider the optimal gene design given by x (cid:63) = arg max x ∈X acq ( x ) . Assume that we are interestedin the production of a certain protein whose sequence is s k . To improve the sequence s k according tothe optimal design rules x (cid:63) without changing the nature of the protein, we can interchange redundantcodons, that is, codons encoding the same aminoacid. See Figure 1 (bottom right). Given a set ofsequences satisfying this criteria, we introduce an evaluation function to rank them in terms of theircoherence with the optimal design. In particular we choose eval ( s | x (cid:63) ) = (cid:80) pj =1 w j | x j − x (cid:63)j | where s is a ‘coherent’ sequence, the x j are the features of s , x (cid:63)j are the features of the optimal design and w j are weights that we choose to be the inverse lengthscales of the K se . See Algorithm 1. In this experiment we use Bayesian Optimization to design genes able to optimize the transcriptionand translation rates of mammalian cells. We use the dataset in [13] in which the rates of 38103igure 2:
Top : Inverse lengthscales of the ARD component of the model.
Bottom left : Optimaldesign rules selected using EI in the context of the true performance of 3810 genes in mammaliancells.
Bottom right
We have shown that Bayesian optimization principles can be successfully used in the design ofsynthetic genes. One of the most important aspects in this process is to have a good surrogate modelfor the cell behavior able to lead to appropriate acquisition functions. Considering future models,the fact that the cell is a extremely complex system will be a key aspect to take into account. Tooptimize certain features of the cell, massive amounts of data will be required, which will requirethe used of sparse Gaussian processes. Regarding the optimization aspects of the problem, in thiswork we have worked with a set of features extracted from the gene sequences, which we haveused to obtain gene design rules rather than optimal sequences. The use of more features willpotentially lead to better and more specific gene design. This will require, however, the developmentof scalable Bayesian optimization methods able to work well in high dimensions in the line of somerecent works [17, 5, 2, 7]. An alternative approach is to focus directly on the optimization on thesequences rather than on extracted features by omitting any previous biological knowledge. Thisseems feasible from the modeling point of view by means of the use of string or related kernels [8]but the optimization of the acquisition functions derived from this models remains challenging.4 cknowledgments : The authors would like to thank BRIC-BBSRC for the support of this project(No BB/K011197/1).
References [1] Mauricio A. ´Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valued functions: Areview.
Found. Trends Mach. Learn. , 4(3):195–266, March 2012.[2] James Bergstra, R´emy Bardenet, Yoshua Bengio, and Bal´azs K´egl. Algorithms for hyper-parameteroptimization. In
NIPS’2011 , 2011.[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensivecost functions, with application to active user modeling and hierarchical reinforcement learning.
CoRR ,abs/1012.2599, 2010.[4] Paul S. Freemont, Richard I. Kitney, Geoff Baldwin, Travis Bayer, Robert Dickinson, Tom Ellis, KarenPolizzi, Guy-Bart Stan, and Richard I. Kitney.
Synthetic Biology - A Primer . World Scientific Publishing,1 edition, July.[5] Roman Garnett, Michael A Osborne, and Philipp Hennig. Active learning of linear embeddings forgaussian processes. In
Conference on Uncertainty in Artificial Intelligence (UAI 2014) , 2014.[6] Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization.
Journal of Machine Learning Research , 13, 2012.[7] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for gen-eral algorithm configuration. In
Proceedings of the 5th International Conference on Learning and Intel-ligent Optimization , LION’05, pages 507–523, Berlin, Heidelberg, 2011. Springer-Verlag.[8] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classifica-tion using string kernels.
J. Mach. Learn. Res. , 2:419–444, March 2002.[9] Jorge Nocedal. Updating Quasi-Newton Matrices with Limited Storage.
Mathematics of Computation ,35(151):773–782, 1980.[10] M. Osborne.
Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature . PhDthesis, PhD thesis, University of Oxford, 2010.[11] Michael Pedersen and Andrew Phillips. Towards programming languages for genetic engineering ofliving cells.
Journal of the Royal Society Interface , April 2009.[12] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for Machine Learning(Adaptive Computation and Machine Learning) . The MIT Press, 2005.[13] Bj¨orn Schwanh¨ausser, Dorothea Busse, Na Li, Gunnar Dittmar, Johannes Schuchhardt, Jana Wolf, WeiChen, and Matthias Selbach. Global quantification of mammalian gene expression control.
Nature ,473(7347):337–342, May 2011.[14] Jasper Snoek, Hugo Larochelle, and Ryan Prescott Adams. Practical bayesian optimization of machinelearning algorithms. In
Advances in Neural Information Processing Systems 25 , 12/2012 2012.[15] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In C.J.C. Burges,L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,
Advances in Neural InformationProcessing Systems 26 , pages 2004–2012. Curran Associates, Inc., 2013.[16] Yee Whye Teh and Matthias Seeger. Semiparametric latent factor models.
Workshop on Artificial Intelli-gence and Statistics 10 , 2005.[17] Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando de Freitas. Bayesian optimizationin high dimensions via random embeddings. In
International Joint Conferences on Artificial Intelligence(IJCAI) - Distinguished Paper Award , 2013., 2013.