F1000Research | 2021

Learning the regulatory grammar of DNA for gene expression engineering

 
 

Abstract


The DNA regulatory code of gene expression is encoded in the gene regulatory structure spanning the coding and adjacent non-coding regulatory DNA regions. Deciphering this regulatory code, and how the whole gene structure interacts to produce mRNA transcripts and regulate mRNA abundance, can greatly improve our capabilities for controlling gene expression. Here, we consider that natural systems offer the most accurate information on gene expression regulation and apply deep learning on over 20,000 mRNA datasets to learn the DNA encoded regulatory code across a variety of model organisms from bacteria to Human [1]. We find that up to 82% of variation of gene expression is encoded in the gene regulatory structure across all model organisms. Coding and regulatory regions carry both overlapping and new, orthogonal information, and additively contribute to gene expression prediction. By mining the gene expression models for the relevant DNA regulatory motifs, we uncover that motif interactions across the whole gene regulatory structure define over 3 orders of magnitude of gene expression levels. Finally, we experimentally verify the usefulness of our AI-guided approach for protein expression engineering.Our results suggest that single motifs or regulatory regions might not be solely responsible for regulating gene expression levels. Instead, the whole gene regulatory structure, which contains the DNA regulatory grammar of interacting DNA motifs across the protein coding and non-coding regulatory regions, forms a coevolved transcriptional regulatory unit. This provides a solution by which whole gene systems with pre-specified expression patterns can be designed.

Volume 10
Pages None
DOI 10.7490/F1000RESEARCH.1118484.1
Language English
Journal F1000Research

Full Text