[PDF] Deciphering the regulatory genome of Escherichia coli , one hundred promoters at a time

Abstract

Advances in DNA sequencing have revolutionized our ability to read genomes. However, even in the most well-studied of organisms, the bacterium Escherichiacoli , for ≈ 65 % of the promoters we remain completely ignorant of their regulation. Until we have cracked this regulatory Rosetta Stone, efforts to read and write genomes will remain haphazard. We introduce a new method (Reg-Seq) linking a massively-parallel reporter assay and mass spectrometry to produce a base pair resolution dissection of more than 100 promoters in E.coli in 12 different growth conditions. First, we show that our method recapitulates regulatory information from known sequences. Then, we examine the regulatory architectures for more than 80 promoters in the E.coli genome which previously had no known regulation. In many cases, we also identify which transcription factors mediate their regulation. The method introduced here clears a path for fully characterizing the regulatory genome of model organisms, with the potential of moving on to an array of other microbes of ecological and medical relevance.

Full PDF

DDeciphering the regulatory genome of

Escherichia coli ,one hundred promoters at a time

William T. Ireland , Suzannah M. Beeler Emanuel Flores-Bautista , Nathan M. Belliveau † ,Michael J. Sweredoski , Annie Moradian , Justin B. Kinney , Rob Phillips Department of Physics, California Institute of Technology, Pasadena, CA 91125 Division of Biology and Biological Engineering, California Institute of Technology, Pasadena,CA 91125 Proteome Exploration Laboratory, Beckman Institute, California Institute of Technology, Pasadena,CA 91125 Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor,NY 11724 Department of Applied Physics, California Institute of Technology, Pasadena, CA 91125 † Present address: Howard Hughes Medical Institute and Department of Biology, University ofWashington, Seattle, WA 98195 * Corresponding author: [email protected]

Abstract

Advances in DNA sequencing have revolutionized our ability to read genomes. However,even in the most well-studied of organisms, the bacterium

Escherichia coli , for ≈

65% of thepromoters we remain completely ignorant of their regulation. Until we have cracked this regu-latory Rosetta Stone, efforts to read and write genomes will remain haphazard. We introducea new method (Reg-Seq) linking a massively-parallel reporter assay and mass spectrometry toproduce a base pair resolution dissection of more than 100 promoters in

E. coli in 12 differentgrowth conditions. First, we show that our method recapitulates regulatory information fromknown sequences. Then, we examine the regulatory architectures for more than 80 promotersin the

E. coli genome which previously had no known regulation. In many cases, we alsoidentify which transcription factors mediate their regulation. The method introduced hereclears a path for fully characterizing the regulatory genome of model organisms, with thepotential of moving on to an array of other microbes of ecological and medical relevance.

DNA sequencing is as important to biology as the telescope is to astronomy. We are now living inthe age of genomics, where DNA sequencing has become cheap and routine. However, despitethese incredible advances, how all of this genomic information is regulated and deployed remainslargely enigmatic. Organisms must respond to their environments through regulation of genes.Genomic methods often provide a “parts” list but often leave us uncertain about how those partsare used creatively and constructively in space and time. Yet, we know that promoters applyall-important dynamic logical operations that control when and where genetic information is ac-cessed. In this paper, we demonstrate how we can infer the logical and regulatory interactions that1 a r X i v : . [ q - b i o . GN ] J a n ontrol bacterial decision making by tapping into the power of DNA sequencing as a biophysicaltool. The method introduced here provides a framework for solving the problem of decipheringthe regulatory genome by connecting perturbation and response, mapping information ﬂow fromindividual nucleotides in a promoter sequence to downstream gene expression, determining howmuch information each promoter base pair carries about the level of gene expression.The advent of RNA-Seq [1] launched a new era in which sequencing could be used as anexperimental read-out of the biophysically interesting counts of mRNA, rather than simply asa tool for collecting ever more complete organismal genomes. The slew of ‘X’-Seq technologiesthat are available continues to expand at a dizzying pace, each serving their own creative andinsightful role: RNA-Seq, ChIP-Seq, Tn-Seq, SELEX, 5C, etc. [2]. In contrast to whole genomescreening sequencing approaches, such as Tn-Seq [3] and ChIP-Seq [4] which give a coarse-grained view of gene essentiality and regulation respectively, another class of experiments knownas massively-parallel reporter assays (MPRA) has been used to study gene expression in a varietyof contexts [5, 63, 7, 8, 9, 10, 11, 12]. One elegant study relevant to the bacterial case of interesthere by [13] screened more than 10 combinations of promoter and ribosome binding sites (RBS).Even more recently, they have utilized MPRAs in sophisticated ways to search for regulatedgenes across the genome [14, 15], in a way we see as being complementary to our own. Whiletheir approach yields a coarse-grained view of where regulation may be occurring, our approachyields a base-pair-by-base-pair view of how exactly that regulation is being enacted.One of the most exciting X-Seq tools based on MPRAs with broad biophysical reach is theSort-Seq approach developed by [63]. Sort-Seq uses ﬂuorescence activated cell sorting (FACS)based on changes in the ﬂuorescence due to mutated promoters to identify the speciﬁc locationsof transcription factor binding in the genome. Importantly, it also provides a readout of howpromoter sequences control the level of gene expression with single base-pair resolution. Theresults of such a massively-parallel reporter assay make it possible to build a biophysical modelof gene regulation to uncover how previously uncharacterized promoters are regulated. Inparticular, high-resolution studies like those described here yield quantitative predictions aboutpromoter organization and protein-DNA interactions as described by energy matrices [63]. Thisallows us to employ the tools of statistical physics to describe the input-output properties of eachof these promoters which can be explored much further with in-depth experimental dissectionlike those done by [16] and [17] and summarized in [18]. In this sense, the Sort-Seq approachcan provide a quantitative framework to not only discover and quantitatively dissect regulatoryinteractions at the promoter level, but also provides an interpretable scheme to design geneticcircuits with a desired expression output [69].Earlier work from [64] illustrated how Sort-Seq, used in conjunction with mass spectrometrycan be used to identify which transcription factors bind to a given binding site, thus enabling themechanistic dissection of promoters which previously had no regulatory annotation. However, acrucial drawback of the approach of [64] is that while it is high-throughput at the level of a singlegene and the number of promoter variants it accesses, it was unable to readily tackle multiplegenes at once, still leaving much of the unannotated genome untouched. Given that even inone of biology’s best understood organisms, the bacterium Escherichia coli , for more than 65% ofits genes, we remain completely ignorant of how those genes are regulated [65, 64]. If we hopeto some day have a complete base pair resolution mapping of how genetic sequences relate tobiological function, we must ﬁrst be able to do so for the promoters of this “simple” organism.2hat has been missing in uncovering the regulatory genome in organisms of all kinds is a largescale method for inferring genomic logic and regulation. Here we replace the low-throughputﬂuorescence-based Sort-Seq approach with a scalable RNA-Seq based approach that makes itpossible to attack multiple promoters at once, setting the stage for the possibility of, to ﬁrstapproximation, uncovering the entirety of the regulatory genome. Accordingly, we refer to theentirety of our approach (MPRA, information footprints and energy matrices, mass spectrometryfor transcription factor identiﬁcation) as Reg-Seq, which we employ here on over one hundredpromoters. The concept of MPRA methods is to perturb promoter regions by mutating themand then using sequencing to read out both perturbation and the resulting gene expression[5, 63, 7, 8, 9, 10, 11, 12]. We generate a broad diversity of promoter sequences for each promoterof interest and use mutual information as a metric to measure information ﬂow from that distri-bution of sequences to gene expression. Thus, Reg-Seq is able to collect causal information aboutcandidate regulatory sequences that is then complemented by mass spectrometry which allowsus to ﬁnd which transcription factors mediate the action of those newly discovered candidateregulatory sequences. Hence, Reg-Seq solves the causal problem of linking DNA sequence toregulatory logic and information ﬂow.To demonstrate our ability to scale up Sort-Seq with the sequencing based Reg-Seq protocol,we report here our results for 113

E. coli genes, whose regulatory architectures (i.e. gene-by-genedistributions of transcription factor (TF) binding sites and identities of TFs that bind those sites)were determined in parallel. By taking the Sort-Seq approach from a gene-by-gene method toa more whole-genome approach, we can begin to piece together not just how individual pro-moters are regulated, but also the nature of gene-gene interactions by revealing how certaintranscription factors serve to regulate multiple genes at once. This approach has the beneﬁts ofa high-throughput assay while sacriﬁcing little of the resolution afforded by the previous gene-by-gene approach, allowing us to uncover a large swath of the

E. coli regulome, with base-pairresolution, in one set of experiments.The organization of the remainder of the paper is as follows. In the Results section, we providea global view of the discoveries we made in our exploration of more than 100 promoters in

E.coli

E. coli genome andopening up the quantitative dissection of other non-model organisms. Lastly, in the Methodssection and ﬂeshed out further in the Supplementary Information, we describe our methodologyand benchmark it against our own earlier Sort-Seq experiments to show that using RNA-Seq asa readout of the expression of mutated promoters is equally reliable as the ﬂuorescence-basedapproach. 3

Results

As shown in Figure 1, we have considered more than 100 genes from across the

E. coli genome.Our choices were based on a number of factors (see Sections 1.1 and 1.2 of the SI for more details);namely, we wanted a subset of genes that served as a “gold standard” for which the hard workof generations of molecular biologists have yielded deep insights into their regulation. The setincludes lacZYA , znuCB , znuA , ompR , araC , marR , relBE , dgoR , dicC , ftsK , xylA , xylF , dpiBA , rspA , dicA , and araAB . By using Reg-Seq on these genes we were able to demonstrate that this methodrecovers not only what was already known about binding sites and transcription factors forwell-characterized promoters, but also whether there are any important differences betweenthe results of the methods presented here and the previous generation of experiments basedon ﬂuorescence and cell-sorting as a readout of gene expression. These promoters of knownregulatory architecture are complemented by an array of previously uncharacterized genes thatwe selected in part using data from a recent proteomic study, in which mass spectrometry wasused to measure the copy number of different proteins in 22 distinct growth conditions [60]. Weselected genes that exhibited a wide variation in their copy number over the different growthconditions considered, reasoning that differential expression across growth conditions impliesthat those genes are under regulatory control.As noted in the introduction, the original formulation of Reg-Seq termed Sort-Seq was basedon the use of ﬂuorescence activated cell sorting one gene at a time as a way to uncover putativebinding sites for previously uncharacterized promoters [64]. As a result, as shown in Figure 2 wehave formulated a second generation version that permits a high-throughput interrogation ofthe genome. A comparison between the Sort-Seq and Reg-Seq approaches for the same genesis shown in Supplemental Figure S1. In the Reg-Seq approach, for each promoter interrogated,we generate a library of mutated variants and design each variant to express an mRNA with aunique sequence barcode. By counting the frequency of each expressed barcode using RNA-Seq,we can assess the differential expression from our promoter of interest based on the base-pair-by-base-pair sequence of its promoter. Using the mutual information between mRNA countsand sequences, we develop an information footprint that reveals the importance of differentbases in the promoter region to the overall level of expression. We locate potential transcriptionfactor binding regions by looking for clusters of base pairs that have a signiﬁcant effect on geneexpression. Further details on how potential binding sites are identiﬁed are found in the Methodssection. Blue regions of the histogram shown in the information footprints of Figure 2 correspondto hypothesized activating sequences and red regions of the histogram correspond to hypothe-sized repressing sequences. With the information footprint in hand, we can then determine energymatrices and sequence logos (described in the next section). Given putative binding sites, weconstruct oligonucleotides that serve as ﬁshing hooks to ﬁsh out the transcription factors that bindto those putative binding sites using DNA-afﬁnity chromatography and mass spectrometry [24].Given all of this information, we can then formulate a schematized view of the newly discoveredregulatory architecture of the previously uncharacterized promoter. For the case schematized inFigure 2, the experimental pipeline yields a complete picture of a simple repression architecture(i.e. a gene regulated by a single binding site for a repressor).4 . coli genome4.6 Mbp oriCregulated operons (34%)operons with noknown regulation (66%)promoters dissected here dicB dicA yncDhicB ynaIycgBminCymgG htrBmsyB ybjTpoxBybiPybiO ftsKybeZdpiBAybdG tigyajL ykgEdnaE rcsFcra araABrapAaraC arcAyjjJyjiY holCidnK bdcRmscMgroSL ecnBaphAcoaAzapBrplKAJL-rpoBCfdhEuvrDhslUilvCxylA asnAmaoPyicIxylFpitAompRmscLarcBacuImtgAyqhCygjPzupTyggWmscSygeRsdaBiap ygdHpcmyfhGndkaegAxapABecoyeiQyehUyehTdusC thiMsbcBrlmA yecEznuAsdiA taryodByedJ yedK ydjAmotAB-cheAWznuCByehS waaA-coaD fdoHatpIBEFHAGDCadiYpyrLBItharABCleuABCDyagHmscKmodErumBybjLycbZdicCrspAydhO THE REGULATORY GENOME OF

ESCHERICHIA COLI : PROMOTERS STUDIED tff-rpsB-tsf

Figure 1: The

E. coli regulatory genome. Illustration of the current ignorance with respect to howgenes are regulated in

E. coli , with genes with previously annotated regulation (as reported onRegulonDB [23]) denoted with blue ticks and genes with no previously annotated regulationdenoted with red ticks. The 113 genes explored in this study are labeled in gray.5 ells withbarcodedmRNAscountpromoter sequenceconstruct libraries grow library in chosen growth condition condition12 unknown regulatory region mutate regulatory regionssequence mRNAs and count i n f o r m a t i o n ( b i t s ) mutatedwild type create an information footprintregulation elucidatedrepressor create energy matrixbarcode e n e r g y ( k B T ) position A -32CGT0.0020 5-35 -15 5257--95 -55 position wild typesequence e n r i c h m e n t scrambledsequence create a regulatory hypothesis? mutation decreases expressionmutation increases expressionrepressorbinding site condition1 condition2 identifiedrepressoridentify transcription factorwith mass spectrometryvs. -37 -27 -22 Figure 2: The Reg-Seq procedure used to determine how a given promoter is regulated. Thisprocess is as follows: After constructing a promoter library driving expression of a randomizedbarcode (an average of 5 for each promoter), RNA-Seq is conducted to determine frequency ofthese mRNA barcodes across different growth conditions. By computing the mutual informationbetween DNA sequence and mRNA barcode counts for each base pair in the promoter region, an”information footprint” is constructed yielding a regulatory hypothesis for the putative bindingsites. Energy matrices, which describe the effect any given mutation has on DNA binding energy,and sequence logos are inferred for the putative transcription factor binding sites. Next, weidentify which transcription factor preferentially binds to the putative binding site via DNAafﬁnity chromatography followed by mass spectrometry. Finally, this procedure culminates in acoarse-grained cartoon-level view of our regulatory hypothesis for how this given promoter isregulated. 6 .2 Visual tools for data presentation

Throughout our investigation of the more than 100 genes explored in this study, we repeatedlyrelied on several key approaches to help make sense of the immense amount of data generated inthese experiments. As these different approaches to viewing the results will appear repeatedlythroughout the paper, here we familiarize the reader with ﬁve graphical representations referredto respectively as information footprints, energy matrices, sequence logos, mass spectrometryenrichment plots and regulatory cartoons, which taken all together provide a quantitative de-scription of previously uncharacterized promoters.

Information footprints:

From our mutagenized libraries of promoter regions, we can build up abase-pair-by-base-pair graphical understanding of how the promoter sequence relates to levelof gene expression in the form of the information footprint shown in the middle of Figure 2.In this plot, the bar above each base pair position represents how large of an effect mutationsat this location have on the level of gene expression. Speciﬁcally, the quantity plotted is themutual information I b at base pair b between mutation of a base pair at that position and thelevel of expression. In mathematical terms, the mutual information measures how much the jointprobability p ( m , µ ) differs from the product of the probabilities p mut ( m ) p expr ( µ ) which wouldbe produced if mutation and gene expression level were independent. Formally, the mutualinformation between having a mutation at position b and level of expression is given by I b = ∑ m = ∑ µ = p ( m , µ ) log (cid:18) p ( m , µ ) p mut ( m ) p expr ( µ ) (cid:19) . (1)Note that both m and µ are binary variables that characterize the mutational state of the base ofinterest and the level of expression, respectively. Speciﬁcally, m can take the values m = (cid:40)

0, if b is a mutated base1, if b is a wild-type base. (2)and µ can take on values µ = (cid:40)

0, for sequencing reads from the DNA library1, for sequencing reads originating from mRNA, (3)where both m and µ are index variables that tell us whether the base has been mutated and if so,how likely that the read at that position will correspond to an mRNA, reﬂecting gene expressionor a promoter, reﬂecting a member of the library. The higher the ratio of mRNA to DNA reads ata given base position, the higher the expression. p mut ( m ) in equation 1 refers to the probabilitythat a given sequencing read will be from a mutated base. p expr ( µ ) is a normalizing factor thatgives the ratio of the number of DNA or mRNA sequencing counts to total number of counts.Furthermore, we color the bars based on whether mutations at this location lowered geneexpression on average (in blue, indicating an activating role) or increased gene expression (in red,indicating a repressing role). Within these footprints, we look for regions of approximately 10 to20 contiguous base pairs which impact gene expression similarly (either increasing or decreasing),as these regions implicate the inﬂuence of a transcription factor binding site. In this experiment,we targeted the regulatory regions based on a guess of where a transcription start site (TSS) will7e, based on experimentally conﬁrmed sites contained in regulonDB [65], a 5’ RACE experiment[61], or by targeting small intergenic regions. After completing the Reg-Seq experiment, wenote that many of the presumed TSS sites are not in the locations assumed, the promoters havemultiple active RNA polymerase (RNAP) sites and TSS, or the primary TSS shifts with growthcondition. To simplify the data presentation, the ’0’ base pair in all information footprints is set tothe originally assumed base pair for the primary TSS, rather than one of the TSS that was found inthe experiment. As can be seen throughout the paper (see Figure 4 for several examples of each ofthe main types of regulatory architectures) and the online resource, we present such informationfootprints for every promoter we have considered, with one such information footprint for everygrowth condition. Energy matrices:

Focusing on an individual putative transcription factor binding site as revealedin the information footprint, we are interested in a more ﬁne-grained, quantitative understandingof how the underlying protein-DNA interaction is determined. An energy matrix displays thisinformation using a heat map format, where each column is a position in the putative binding siteand each row displays the effect on binding that results from mutating to that given nucleotide(given as a change in the DNA-TF interaction energy upon mutation) [63, 26, 27]. These energymatrices are scaled such that the wild type sequence is colored in white, mutations that improvebinding are shown in blue, and mutations that weaken binding are shown in red. These energymatrices encode a full quantitative picture for how we expect sequence to relate to binding for agiven transcription factor, such that we can provide a prediction for the binding energy of everypossible binding site sequence as binding energy = N ∑ i = ε i , (4)where the energy matrix is predicated on an assumption of a linear binding model in which eachbase within the binding site region contributes a speciﬁc value ( ε i for the i th base in the sequence)to the total binding energy. Energy matrices are either given in A.U. (arbitrary units), or if thegene has a simple repression or activation architecture with a single RNA polymerase (RNAP)site, are assigned k B T energy units following the procedure in [63] and validated on the lac operonin [69].

Sequence logos:

From an energy matrix, we can also represent a preferred transcription factorbinding site with the use of the letters corresponding to the four possible nucleotides, as isoften done with position weight matrices [28]. In these sequence logos, the size of the letterscorresponds to how strong the preference is for that given nucleotide at that given position, whichcan be directly computed from the energy matrix. This method of visualizing the informationcontained within the energy matrix is more easily digested and allows for quick comparisonamong various binding sites.

Mass spectrometry enrichment plots:

As the ﬁnal piece of our experimental pipeline, we wish todetermine the identity of the transcription factor we suspect is binding to our putative bindingsite that is represented in the energy matrix and sequence logo. While the details of the DNAafﬁnity chromatography and mass spectrometry can be found in the methods, the results of theseexperiments are displayed in enrichment plots such as is shown in the bottom panel of Figure 2.In these plots, the relative abundance of each protein bound to our site of interest is quantiﬁedrelative to a scrambled control sequence. The putative transcription factor is the one we ﬁnd to be8ighly enriched compared to all other DNA binding proteins.

Regulatory cartoons:

The ultimate result of all these detailed base-pair-by-base-pair resolutionexperiments yields a cartoon model of how we think the given promoter is being regulated. Acomplete set of cartoons for all the architectures considered in our study is presented in Sup-plemental Figure S6. While the cartoon serves as a convenient visual way to summarize ourresults, it’s important to remember that these cartoons are a shorthand representation of allthe data in the four quantitative measures described above and are in fact backed by quanti-tative predictions of how we expect the system to behave which can be tested experimentally.Throughout this paper we use consistent iconography to illustrate the regulatory architectureof promoters, with activators and their binding sites in green, repressors in red, and RNAP in blue.

E. coli regulatory architectures

Figure 3 (and Tables 1 and 2) provides a summary of the discoveries made in the work done hereusing our next generation Reg-Seq approach. Figure 3(A) provides a shorthand notation thatconveniently characterizes the different kinds of regulatory architectures found in bacteria. Inprevious work [29], we have explored the entirety of what is known about the regulatory genomeof

E. coli , revealing that the most common motif is the (0,0) constitutive architecture, thoughwe hypothesized that this is not a statement about the facts of the

E. coli genome, but rather areﬂection of our collective regulatory ignorance in the sense that we suspect that with furtherinvestigation, many of these apparent constitutive architectures will be found to be regulatedunder the right environmental conditions. The two most common regulatory architectures thatemerged from our previous database survey are the (0,1) and (1,0) architectures, the simplerepression motif and the simple activation motif, respectively. It is interesting to consider that the(0,1) architecture is in fact the repressor-operator model originally introduced in the early 1960sby Jacob and Monod as the concept of gene regulation emerged [30]. Now we see retrospectivelythe far-reaching importance of that architecture across the

E. coli genome.For the 113 genes we considered, Figure 3(B) summarizes the number of simple repression (

0, 1 ) architectures discovered, the number of simple activation (

1, 0 ) architectures discoveredand so on. A comparison of the frequency of the different architectures found in our study tothe frequencies of all the known architectures in the RegulonDB database is provided in Sup-plemental Figure S7. Tables 1 and 2 provide a more detailed view of our results. As seen inTable 1, of the 113 genes we considered, 32 of them revealed no signature of any transcriptionfactor binding sites and they are labeled as (

0, 0 ) . The simple repression architecture (

0, 1 ) wasfound 26 times, the simple activation architecture (

1, 0 ) was found 13 times, and more complexarchitectures featuring multiple binding sites (e.g. (

1, 1 ) , (

0, 2 ) , (

2, 0 ) , etc.) were revealed as well.Further, for 18 of the genes that we label “inactive”, Reg-Seq didn’t even reveal an RNAP bindingsite. The lack of observable RNAP site could be because the proper growth condition to gethigh levels of expression was not used, or because the mutation window chosen for the genedoes not capture a highly transcribing TSS. The tables also include our set of 16 “gold standard”genes for which previous work has resulted in a knowledge (sometimes only partial) of theirregulatory architectures. We ﬁnd that our method recovers the regulatory elements of these goldstandard cases fully in 12 out of 16 cases, and the majority of regulatory elements in 2 of theremainder. Overall the performance of Reg-Seq in these gold-standard cases (for more details see9upplemental Figure S2) builds conﬁdence in the approach. Further, the failure modes inform usof the blind spots of Reg-Seq. For example, we ﬁnd it challenging to observe weaker binding siteswhen multiple strong binding sites are also present such as in the marRAB operon. Additionallythe method will fail when there is no active TSS in the mutation window, as occurred in the caseof dicA . Further details on the comparison to gold standard genes can be found in SI Section 2.2. activatorrepressorpromoter architecture (0, 0)(1, 0)(0, 1)(1, 1)(2, 0)(0, 2) number ofrepressorsnumber ofactivators promoter architecture nu m b e r o f p r o m o t e r s (1, 2) (2, 2) (3, 0)(0, 1) (1, 0) (0, 2) (2, 0) (1, 1) (2, 1)10152025 nu m b e r o f T F b i n d i n g s i t e s location of TF binding site ,

840 -100 -80 -60 -40 -20 0 20 activatorrepressor(C)(B)(A) ( )

Figure 3: A summary of regulatory architectures discovered in this study. (A) The cartoonsdisplay a representative example of each type of architecture, along with the correspondingshorthand notation. (B) Counts of the different regulatory architectures discovered in this study.Only those promoters where at least one new binding site was discovered are included in thisﬁgure. If one repressor was newly discovered and two activators were previously known, thenthe architecture is still counted as a (2,1) architecture. (C) Distribution of positions of binding sitesdiscovered in this study for activators and repressors. Only newly discovered binding sites areincluded in this ﬁgure. The position of the TF binding sites are calculated relative to the estimatedTSS location, which is based on the location of the associated RNAP site.We observe that the most common motif to emerge from our work is the simple repressionmotif. Another relevant regulatory statistic is shown in Figure 3(C) where we see the distributionof binding site positions. Our own experience in the use of different quantitative modeling ap-proaches to consider transcriptional regulation reveal that, for now, we remain largely ignorant ofhow to account for transcription factor binding site position, and datasets like that presented herewill begin to provide data that can help us uncover how this parameter dictates gene expression.Indeed, with binding site positions and energy matrices in hand, we can systematically movethese binding sites and explore the implications for the level of gene expression, providing a10ystematic tool to understand the role of binding-site position.Architecture Total numberof promoters Number of promoterswith at least one newlydiscovered binding siteAll Architectures 113 52(0,0) 32 0(0,1) 26 23(1,0) 13 10(1,1) 6 6(0,2) 4 3(2,0) 6 5(2,1) 2 2(1,2) 1 1(2,2) 1 1(3,0) 3 1(0,4) 1 0inactive 18 0Table 1: All promoters examined in this study, categorized according to type of regulatoryarchitecture. Those promoters which have no recognizable RNAP site are labeled as inactiverather than constitutively expressed (0, 0).Figure 4 delves more deeply into the various regulatory architectures described in Figure3(B) by showing several example promoters for each of the different architecture types. In eachof the cases shown in the ﬁgure, prior to the work presented here, these promoters had noregulatory information in relevant databases such as Ecocyc [31] and RegulonDB [65]. Now,using the sequencing methods explained above we were able to identify candidate binding sites.For a number of cases, these putative binding sites were then used to synthesize oligonuceotideprobes to enrich and identify their corresponding putative transcription factor using mass spec-trometry. While Figure 4 gives a sense of the kinds of regulatory architectures we discoveredin this study, our entire collection of regulatory cartoons can be found in Supplementary Figure S6.A recent paper christened that part of the

E. coli genome for which the function of the genes isunknown the y-ome [32]. Their surprising ﬁnding is that roughly 35% of the genes in the

E. coli genome are functionally unannotated. The situation is likely worse for other organisms. For manyof the genes in the y-ome, we remain similarly ignorant of how those genes are regulated. Figures4 and 5 provide several examples from the y-ome, of genes and transcription factors for whichlittle to nothing was previously known. As shown in Figure 5, our study has found the ﬁrst exam-ples that we are aware of in the entire

E. coli acuI adiY arcB coaA dnaE ecnB holC hslU htrB modE motAB-cheAW poxB rcsF rumB sbcB sdaB ybdG ybiP ybjL ybjT yehS yehT yfhG ygdH ygeR yggW ygjP ynaI yqhC zapB zupT amiC aegA bdcR dicC fdoH groSL idnK leuABCD pcm yedK rapA sdiA tar tff-rpsB-tsf thiM tig ycgB ydjA yedJ ycbZ phnA mutM rhlE uvrD dusC ftsK znuA waaA-coaD cra iap araC minC ybeZ mscM mscS rlmA thrLABC yeiQ dgoR lac yecE yjjJ dcm marR ilvC ybiO yehU ykgE ymgG znuCB aphA arcA asnA fdhE xylF mscL maoP rspA dinJ ybjX araAB xylA yicI relBE eoR (2, 0)

YdfH0.0020 5-35 -15 5257--95 -55 i n f o r m a t i o n ( b i t s ) position e n r i c h m e n t ( b i t s )

84 0 rspArepressor (1, 2) activator (0, 2)

DeoR aphA sodium salycilategrowth aerobicanaerobic (2, 1)(0, 1)YgbI position e n r i c h m e n t

126 012060 0 idnK promoterarchitecture(A) information footprintinformation footprint information footprintinformation footprint

YbgIYbgI0.001 MarA0.005 -95 -55 -15 25-95 -55 -15 25-95 -55 -15 25 position leuABCD

RNAP

FNRDeoR -15-55-95 25

RNAP

RNAPYgbIRNAP

HNSStpA1810 0PhoP0.003 i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) position ybjXyjjJ RNAP (2, 2) e n r i c h m e n t -95 -55 -15 25StpAHNSPhoP enrichmentratiofrom massspectrometry enrichmentratiofrom massspectrometryinformation footprint enrichmentratiofrom massspectrometryenrichmentratiofrom massspectrometry number ofrepressorsnumber ofactivators , ( ) promoterarchitecture(B) number ofrepressorsnumber ofactivators , ( ) promoterarchitecture(D) number ofrepressorsnumber ofactivators , ( ) promoterarchitecture(C) number ofrepressorsnumber ofactivators , ( ) promoterarchitecture(E) number ofrepressorsnumber ofactivators , ( ) DeoR e n r i c h m e n t mutationincreasesexpressionmutationdecreasesexpression ∆HdfR0.003 maoP i n f o r m a t i o n ( b i t s ) DeoR0.002 position e n r i c h m e n t rspA DeoR

CRP RNAP-15-55-95 25YdfHRNAP activatorrepressor

Figure 4: Newly discovered or updated regulatory architectures. Examples of informationfootprints, gene knockouts, and mass spectrometry data used to identify transcription factorsfor ﬁve genes. (A) Examples of simple repression, i.e. (0, 1) architectures where the locations ofthe putative binding sites are highlighted in red and the identities of the bound transcriptionfactors are revealed in the mass spectrometry data. (B) An example of a (2, 0) architecture. Duringaerobic growth FNR is inactive, but the DeoR site now has a signiﬁcant effect on expression. (C)An example of a (0, 2) architecture. yjjJ is regulated by MarA, which is only active in growth withsodium salycilate, and an unknown repressor. (D) An example of a (2, 1) architecture. (E) Anexample of a (2, 2) architecture. 16he ability to ﬁnd binding sites for both widely acting regulators and transcription factorswhich may have only a few sites in the whole genome allows us to get an in-depth and quantita-tive view of any given promoter. As indicated in Figures 5(A) and (B), we were able to performthe relevant search and capture for the transcription factors that bind our putative binding sites.In both of these cases, we now hypothesize that these newly discovered binding site-transcriptionfactor pairs exert their control through repression. The ability to extract the quantitative featuresof regulatory control through energy matrices means that we can take a nearly unstudied genesuch as ykgE , which is regulated by an understudied transcription factor YieP, and quickly get tothe point at which we can do quantitative modeling in the style that we and many others haveperformed on the lac operon [35, 36, 37, 63, 70, 39, 69, 18].One of the revealing case studies that demonstrates the broad reach of our approach fordiscovering regulatory architectures is offered by the insights we have gained into two widelyacting regulators, GlpR [40] and FNR [41, 42]. In both cases, we have expanded the array ofpromoters that they are now known to regulate. Further, these two case studies illustrate thateven for widely acting transcription factors, there is a large gap in regulatory knowledge and theapproach advanced here has the power to discover new regulatory motifs. The newly discoveredbinding sites in Figure 6(A) more than double the number of operons known to be regulated byGlpR as reported in RegulonDB [65]. We found 5 newly regulated operons in our data set, eventhough we were not speciﬁcally targeting GlpR regulation. Although the number of examplepromoters across the genome that we considered is too small to make good estimates, ﬁnding5 regulated operons out of approximately 100 examined operons supports the claim that GlpRwidely regulates and many more of its sites would be found in a full search of the genome. Theregulatory roles revealed in Figure 6(A) also reinforce the evidence that GlpR is a repressor.For the GlpR-regulated operons newly discovered here, we found that this repressor bindsstrongly in the presence of glucose while all other growth conditions result in greatly diminished,but not entirely abolished, binding (Figure 6(A)). As there is no previously known direct molecu-lar interaction between GlpR and glucose and the repression is reduced but not eliminated, thederepression in the absence of glucose is likely an indirect effect. As a potential mechanism of theindirect effect, gpsA is known to be activated by CRP [43], and GpsA is involved in the synthesisof glycerol-3-phosphate (G3P), a known binding partner of GlpR which disables its repressiveactivity [44]. Thus in the presence of glucose GpsA and consequently G3P will be found in lowconcentration, ultimately allowing GlpR fulﬁll its role as a repressor.Prior to this study, there were 4 operons known to be regulated by GlpR, each with between 4and 8 GlpR binding sites [23], where the absence of glucose and the partial induction of GlpRwas not enough to prompt a notable change in gene expression [45]. These previously exploredoperons seemingly are regulated as part of an AND gate, where high G3P concentration and anabsence of glucose is required for high gene expression. By way of contrast, we have discoveredoperons whose regulation appears to be mediated by a single GlpR site per operon. With only asingle site, GlpR functions as an indirect glucose sensor, as only the absence of glucose is neededto relieve repression by GlpR. 17 .00200.0010 i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) position position mutationdecreasesexpressionmutationincreasesexpression e n e r g y ( k B T ) A -35-30 -30 -25 -20CGT ACGT 210-1-2 e n r i c h m e n t e n r i c h m e n t e n e r g y ( A . U . ) ykgEykgE ∆YieP ykgE anaerobic growthFNR positionYiePYciT YciTYciT phnA (A)(B)

RNAPRNAPRNAP

RNAP RNAPFNRFNR YiePYiePYieP -20-25 -15 position

Figure 5: Examples of the insight gained by Reg-Seq in the context of promoters with no previouslyknown regulatory information. (A) From the information footprint of the ykgE promoter underdifferent growth conditions, we can identify a repressor binding site downstream of the RNAPbinding site. From the enrichment of proteins bound to the DNA sequence of the putativerepressor as compared to a control sequence, we can identify YieP as the transcription factorbound to this site as it has a much higher enrichment ratio than any other protein. Lastly, thebinding energy matrix for the repressor site along with corresponding sequence logo shows thatthe wild type sequence is the strongest possible binder and it displays an imperfect invertedrepeat symmetry. (B) Illustration of a comparable dissection for the phnA promoter.18

NAPRNAP RNAPYiePFNR tig G A C T G C T C G T C A T C G GT C C A G A T C A T C AG C T G C T C G T A C G T rhlE T G A C G A C C A G C C T A G G T C T A C G C C A C T A G T C G A C C A rapA A G C G T C A C G G C T T C T C T G A G C G C A T C C T G C T G C T TA G C C T A C G C A G C AG C G A T C G A C G T A C A C T C T A G G T A maoPtff C T G C A G G T A C G C C G TG C A C G A T G C T C A G A G T C G A C A G C T A T C G T A e n r i c h m e n t rhlE0.0020.004 i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) aphAarcA position yehUykgE FNR0.0010.0030.002 FNR0.002 yeiQ

FNRFNRDeoR FNRArcAFNR 0.001 C T G A G T C A C A G T G A C T A G T G C A G T C A G T C C G A T G T C A

11 12 T G A A G C T G T A C G T A C T G A C T G A C A G T A C T A G C T G T A T G A A G T A C G T G A T A C T C T T G G T G A G T C A A G T C

19 20 T A G C T A C G T C A G G C T A C A G C T C A C T G A A G T C A A C T A C G A T G G T A G A T C A G C A T C A C T G C T G G A T C G A T A G C GT C A T G A C T A T G C C T G A G A C T C G A T G T T A G C G C A T A G C G T C G T G C A G A C T C A G G A C T C G A T G C C G A A G T T G C A G C T A T A G C C G A G C T A G A T A C G GCA T C G A C T G A C T G A C T A T G A G T A C T A A C T G G A C T G A AG T C T G A C G T C G C T A A G C T T A C G AG C T A G T C C T G A T G A C T G C A arcA aphAfdhEyehUykgEyeiQ T T A G AC T A G G C T A T G C A A T C A T G C G T A C TC G A T C G C T A C G T A T A C T C A YiePFNRFNRDeoRFNR ArcA FNRFNR mutationincreasesexpressionmutationdecreasesexpression i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) position rapA glucoseno glucoseglucoseno glucose tig GlpRGlpRGlpRPhoPHdfR maoP glucose0.0050.006

GlpR glucoseno glucose no glucose

GlpR GlpR position sequence logosfor FNR binding sitessequence logosfor GlpR binding sitesglucoseno glucose rhlE (A)(B) -15-55-95 25 -15-55-95 25-15-55-95 25-15-55-95 25-15-55-95 25-15-55-95 25 aerobicanaerobicaerobicanaerobicaerobicanaerobicaerobicanaerobicaerobicanaerobicaerobicanaerobic -15-55-95 25-15-55-95 25-15-55-95 25

RNAP RNAPGlpRRNAPGlpR GlpR

HdfR

PhoPRNAP

RNAPRNAP

GlpRRNAP tfffdhE

GlpR

RNAP activatorrepressor

Figure 6: Reg-Seq analysis of broadly-acting transcription factors. (A) GlpR as a widely-actingregulator. Here we show the many promoters which we found to be regulated by GlpR, all ofwhich were previously unknown. GlpR was demonstrated to bind to rhlE by mass spectrometryenrichment experiments as shown in the top right. Binding sites in the tff , tig , maoP , rhlE , and rapA have similar DNA binding preferences as seen in the sequence logos and each TF bindingsite binds strongly only in the presence of glucose. These similarities suggest that the same TFbinds to each site. To test this hypothesis we knocked out GlpR and ran the Reg-Seq experimentsfor tff , tig , and maoP . We see that knocking out GlpR removes the binding signature of the TF. (B)FNR as a global regulator. FNR is known to be upregulated in anaerobic growth, and here wefound it to regulate a suite of six genes. In growth conditions with prevalent oxygen the putativeFNR sites are weakened, and the DNA binding preference of the six sites are shown to be similarfrom the sequence logos displayed on the right.19he second widely acting regulator our study revealed, FNR, has 151 binding sites alreadyreported in RegulonDB and is well studied compared to most transcription factors [23]. How-ever, the newly discovered FNR sites displayed in Figure 6(B) demonstrate that even for well-understood transcription factors there is much still to be uncovered. Our information footprintsare in agreement with previous studies suggesting that FNR acts as an activator. In the presenceof O , dimeric FNR is converted to a monomeric form and its ability to bind DNA is greatlyreduced [46]. Only in low oxygen conditions did we observe a binding signature from FNR, andwe show a representative example of the information footprint from one of 11 growth conditionswith plentiful oxygen in Figure 6(B). regulonDB ArcAArcA regulating fdhE regulonDB FNRFNR regulating arcA FNR i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) position position arcA ArcAFNR fdhE positionArcA fdoI

FNRFNR fdoGHI-fdhE A G T C A A C T A C G A T G G T A G A T C A G C A T C A C T G C T G G A T C G A T A G C G T C A T G A C T A T G C FNR regulating fdhE (A) (B)

ACGT position -90 -80 -70 A G C T G A C T C G T A CG A T G C A T C A T A G A G A T CAT G G T C T A T A G C C T A G T C T A G A G T A C T C A C T A G C A -50 -45 -40 -35 ACGT e n e r g y ( A . U . ) e n e r g y ( A . U . ) ΔArcA

RNAP RNAP arcA fdoG fdoH fdhE

Figure 7: Inspection of an anaerobic respiration genetic circuit. (A) Here we see not only how the arcA promoter is regulated, but also the role this transcription factor plays in the regulation ofanother promoter. (B) Intra-operon regulation of fdhE by both FNR and ArcA. A TOMTOM [73]search of the binding motif found that ArcA was the most likely candidate for the transcriptionfactor. A knockout of ArcA demonstrates that the binding signature of the site, and its associatedRNAP site, are no longer signiﬁcant determinants of gene expression.We observe quantitatively how FNR affects the expression of fdhE both directly throughtranscription factor binding (Figure 7(A)) and indirectly through increased expression of ArcA(Figure 7(B)). Also, fully understanding even a single operon often requires investigating severalregulatory regions as we have in the case of fdoGHI-fdhE by investigating the main promoterfor the operon as well as the promoter upstream of fdhE . 36% of all multi-gene operons haveat least one TSS which transcribes only a subset of the genes in the operon [48]. Regulation20ithin an operon is even more poorly studied than regulation in general. The main promoterfor fdoGHI-fdhE has a repressor binding site, which demonstrates that there is regulatory controlof the entire operon. However, we also see in Figure 7(B) that there is control at the promoterlevel, as fdhE is regulated by both ArcA and FNR and will therefore be upregulated in anaerobicconditions [66]. The main TSS transcribes all four genes in the operon, while the secondary siteshown in Figure 7(B) only transcribes fdhE , and therefore anaerobic conditions will change thestoichiometry of the proteins produced by the operon. At the higher throughput that we use inthis experiment it becomes feasible to target multiple promoters within an operon as we havedone with fdoGHI-fdhE . We can then determine under what conditions an operon is internallyregulated. Figure 7 also makes it clear that for cases such as fdoGHI-fdhE , there are many subtletiesboth in the interpretation of the information footprints and in the construction of regulatorycartoons that are simultaneously accurate and transparent. A crucial next step in the developmentof these analyses is to move from manual curation of the data to automated statistical analysesthat can help make sense of these complicated datasets. e n e r g y ( A . U ) bdcR NsrRgene:bdcRgrowth:arabinose growth -105 -85 -65 -45 -25 -5 15 35 position I n f o r m a t i o n ( b i t s ) mutation increases expressionmutation decreases expression G A C C G A G T A C A T GA T C G C A T C A T G T G A C A T G G T C A C G A G C C A T -25 information footprint ACGT -20

The study of gene regulation is one of the centerpieces of modern biology. As a result, it is surpris-ing that in the genome era, our ignorance of the regulatory landscape in even the best-understoodmodel organisms remains so vast. Despite understanding the regulation of transcription initiationin bacterial promoters [50], and how to tune their expression, we lack an experimental frameworkto unravel understudied promoter architectures at scale. As such, in our view one of the grandchallenges of the genome era is the need to uncover the regulatory landscape for each and everyorganism with a known genome sequence. Given the ability to read and write DNA sequence atwill, we are convinced that to make that reading of DNA sequence truly informative about bio-logical function and to give that writing the full power and poetry of what Crick christened “thetwo great polymer languages”, we need a full accounting of how the genes of a given organismare regulated and how environmental signals communicate with the transcription factors thatmediate that regulation – the so-called “allosterome” problem [51]. The work presented here pro-vides a general methodology for making progress on the former problem and also demonstratesthat, by performing Reg-Seq in different growth conditions, we can make headway on the latterproblem as well.The advent of cheap DNA sequencing offers the promise of beginning to achieve that grandchallenge goal in the form of MPRAs reviewed in [12]. A particular implementation of suchmethods was christened Sort-Seq [63] and was demonstrated in the context of well understoodregulatory architectures. A second generation of the Sort-Seq method [64] established experi-ments through the use of DNA-afﬁnity chromatography and mass spectrometry which made itpossible to identify the transcription factors that bind the putative binding sites discovered bySort-Seq. But there were critical shortcomings in the method, not least of which was that it lackedthe scalability to uncover the regulatory genome on a genome-wide basis.The work presented here builds on the foundations laid in the previous studies by invokingRNA-Seq as a readout of the level of expression of the promoter mutant libraries needed to inferinformation footprints and their corresponding energy matrices and sequence logos followed bya combination of mass spectrometry and gene knockouts to identify the transcription factors thatbind those sites. The case studies described in the main text showcase the ability of the methodto deliver on the promise of beginning to uncover the regulatory genome systematically. Theextensive online resources hint at a way of systematically reporting those insights in a way thatcan be used by the community at large to develop regulatory intuition for biological function andto design novel regulatory architectures using energy matrices.However, several shortcomings remain in the approach introduced here. First, the current22mplementation of Reg-Seq still largely relies on manual curation as the basis of using informationfootprints to generate testable regulatory hypotheses. As described in the methods section, wehave also used statistical testing as a way to convert information footprints into regulatory hy-potheses, but there clearly remains much work to be done on the data analysis pipeline to improveboth the power and the accuracy of this approach. In addition, these regulatory hypothesescan also be converted into gene regulatory models using statistical physics [52, 37]. However,here too, as the complexity of the regulatory architectures increases, it will be of great interest touse automated model generation as suggested in a recent biophysically-based neural networkapproach [53].A second key challenge faced by the methods described here is that the mass spectrometryand the gene knockout conﬁrmation aspects of the experimental pipeline remain low-throughput.To overcome this, we have begun to explore a new generation of experiments such as in vitro binding assays that will make it possible to accomplish transcription factor identiﬁcation at higherthroughput. Speciﬁcally, we are exploring multiplexed mass spectrometry measurements andmultiplexed Reg-Seq on libraries of gene knockouts as ways to break the identiﬁcation bottleneck.Another shortcoming of the current implementation of the method is that it would miss regula-tory action at a distance. Indeed, our laboratory has invested a signiﬁcant effort in exploring suchlong-distance regulatory action in the form of DNA looping in bacteria and VDJ recombinationin jawed vertebrates. It is well known that transcriptional control through enhancers in eukary-otic regulation is central in contexts ranging from embryonic development to hematopoiesis[9]. The current incarnation of the methods described here have focused on contiguous regionsin the vicinity of the transcription start site. Clearly, to go further in dissecting the entire reg-ulatory genome, these methods will have to be extended to non-contiguous regions of the genome.The ﬁndings from this study provide a foundation for systematically performing genome-wideregulatory dissections. We have developed a method to pass from complete regulatory ignoranceto designable regulatory architectures and we are hopeful that others will adopt these methodswith the ambition of uncovering the regulatory architectures that preside over their organisms ofinterest.

Promoter variants were synthesized on a microarray (TWIST Bioscience, San Francisco, CA). Thesequences were designed computationally such that each base in the 160 bp promtoter regionhas a 10% probability of being mutated. For each given promoter’s library, we ensured that themutation rate as averaged across all sequences was kept between 9.5% and 10.5%, otherwisethe library was regenerated. There are an average of 2200 unique promoter sequences per gene(for an analysis of how our results depend upon number of unique promoter sequences seeSupplementary Figure S3). An average of 5 unique 20 base pair barcodes per variant promoterwas used for the purpose of counting transcripts. The barcode was inserted 110 base pairs fromthe 5’ end of the mRNA, containing 45 base pairs from the targeted regulatory region, 64 basepairs containing primer sites used in the construction of the plasmid, and 11 base pairs containing23 three frame stop codon. All the sequences are listed in Supplementary Table 1. Following thebarcode there is an RBS and a GFP coding region. Mutated promoters were PCR ampliﬁed andinserted by Gibson assembly into the plasmid backbone of pJK14 (SC101 origin) [63]. Constructswere electroporated into

E. coli

K-12 MG1655 [54].

Cells were grown to an optical density of 0.3 and RNA was then stabilized using Qiagen RNAProtect (Qiagen, Hilden, Germany). Lysis was performed using lysozyme (Sigma Aldrich, SaintLouis, MO) and RNA was isolated using the Qiagen RNA Mini Kit. Reverse transcription waspreformed using Superscript IV (Invitrogen, Carlsbad, CA) and a speciﬁc primer for the labeledmRNA. qPCR was preformed to check the level of DNA contamination and the mRNA tagswere PCR ampliﬁed and Illumina sequenced. Within a single growth condition, all promotervariants for all regulatory regions were tested in a single multiplexed RNA-Seq experiment. Allsequencing was carried out by either the Millard and Muriel Jacobs Genetics and GenomicsLaboratory at Caltech (HiSeq 2500) on a 100 bp single read ﬂow cell or using the sequencingservices from NGX Bio on a 250 bp or 150 base paired end ﬂow cell.

To determine putative transcription factor binding sites, we ﬁrst compute the effect of mutationson gene expression at a base pair-by-base pair level using information footprints. The informationfootprints are a hypothesis generating tool and we choose which regions to further investigateusing techniques such as mass spectrometry by visually inspecting the data for regions of 10to 20 base pairs that have high information content compared to background. Our techniquecurrently relies on using human intuition to determine binding sites, but to validate these choicesand to capture all regions important for gene expression we computationally identify regionswhere gene expression is changed signiﬁcantly up or down by mutation (p < E.coli sigma factor binding sites (for example, do the preferred bases in the energy matrix havefew mismatches to the TGNTATAAT extended minus 10 for σ sites), and the TOMTOM tool[73] to computationally compare the potential site to examples of σ , σ , and σ sites that wedetermined in this experiment. For further details see Supplementary Figure S4. We discard anysites that have a p-value of similarity with an RNAP site of less than 5 x − in the TOMTOManalysis or are deemed to be too visually similar to RNAP sites. If a single site contains an RNAPsite along with a transcription factor site we remove only those bases containing the probable24NAP site. This results in 95 identiﬁed transcription factor binding regions.For primary RNAP sites, we include a list of probable sigma factor identities as SupplementaryTable 2. Sites are judged by visual similarity to consensus binding sites. Those sites where the truesigma factor is unclear due to overlapping binding sites are omitted. Overlapping binding sites(from multiple TFs or RNAP sites) in general can pose issues for this method. In many cases, look-ing at growth conditions where only one of the relevant transcription factors is present or active isan effective way to establish site boundaries and infer correct energy matrices. For sites where noadequate growth condition can be found, or when a TF overlaps with an RNAP site, the energymatrix will not be reﬂective of the true DNA-protein interaction energies. If the TFs in overlappingsites are composed of one activator and one repressor, then we use the point at which the effectof mutation shifts from activator-like to repressor-like as a demarcation point between bindingsites. We see a case of a potentially overlooked repressor due to overlapping sites in Figure 4(B),as there are several repressor like bases overlapping the RNAP -10 site and the effect weakens inlow oxygen growth. However, due to the effect of the RNAP site, when averaged over a poten-tial 15 base pair region, the repressor-like bases do not have a signiﬁcant effect on gene expression. Upon identifying a putative transcription factor binding site, we used DNA afﬁnity chromatogra-phy, as done in [64] to isolate and enrich for the transcription factor of interest. In brief, we orderbiotinylated oligos of our binding site of interest (Integrated DNA Technologies, Coralville, IA)along with a control, ”scrambled” sequence, that we expect to have no speciﬁcity for the giventranscription factor. We tether these oligos to magnetic streptavidin beads (Dynabeads MyOne T1;ThermoFisher, Waltham, MA), and incubate them overnight with whole cell lysate grown in thepresences of either heavy (with N) or light (with N) lysine for the experimental and controlsequences, respectively. The next day, proteins are recovered by digesting the DNA with the PtsIrestriction enzyme (New England Biolabs, Ipswich, MA), whose cut site was incorporated into alldesigned oligos.Protein samples were then prepared for mass spectrometry by either in-gel or in-solutiondigestion using the Lys-C protease (Wako Chemicals, Osaka, Japan). Liquid chromatographycoupled mass spectrometry (LC-MS) was performed as previously described by [64], and isfurther discussed in the SI. SILAC labeling was performed by growing cells ( ∆ LysA) in eitherheavy isotope form of lysine or its natural form.It is also important to note that while we relied on the SILAC method to identify the TF identityfor each promoter, our approach doesnt require this speciﬁc technique. Speciﬁcally, our methodonly requires a way to contrast between the copy number of proteins bound to a target promoterin relation to a scrambled version of the promoter. In principle, one could use multiplexedproteomics based on isobaric mass tags [55] to characterize up to 10 promoters in parallel. Isobarictags are reagents used to covalently modify peptides by using the heavy-isotope distribution inthe tag to encode different conditions. The most widely adopted methods for isobaric tagging arethe isobaric tag for relative and absolute quantitation (iTRAQ) and the tandem mass tag (TMT).This multiplexed approach involves the fragmentation of peptide ions by colliding with an inertgas. The resulting ions are resolved in a second MS-MS scan (MS2).25nly a subset (13) of all transcription factor targets were identiﬁed by mass spectrometrydue to limitations in scaling the technique to large numbers of targets. The transcription factorsidentiﬁed by this method are enriched more than any other DNA binding protein, with p < Conducting DNA afﬁnity chromatography followed by mass spectrometry on putative bindingsites resulted in potential candidates for the transcription factors that are responsible for the infor-mation contained at a given promoter region. For some cases, to verify that a given transcriptionfactor is, in fact, regulating a given promoter, we repeated the RNA sequencing experiments onstrains with the transcription factor of interest knocked out.To construct the knockout strains, we ordered strains from the Keio collection [58] from theColi Genetic Stock Center. These knockouts were put in a MG1655 background via phage P1transduction and veriﬁed with Sanger sequencing. To remove the kanamycin resistance thatcomes with the strains from the Keio collection, we transformed in the pCP20 plasmid, inducedFLP recombinase, and then selected for colonies that no longer grew on either kanamycin orampicillin. Finally, we transformed our desired promoter libraries into the constructed knockoutstrains, allowing us to perform the RNA sequencing in the same context as the original experi-ments.

All code used for processing data and plotting as well as the ﬁnal processed data, plasmidsequences, and primer sequences can be found on the GitHub repository (https://github.com/RPGroup-PBoC/RNAseq SortSeq) doi:10.5281/zenodo.3611914. Energy matrices were generatedusing the MPAthic software [59]. All raw sequencing data is available at the Sequence ReadArchive (accession no.PRJNA599253). All inferred information footprints and energy matricescan be found on the CalTech data repository doi:10.22002/D1.1331. All mass spectrometry rawdata is available on the CalTech data repository doi:10.22002/d1.1336

We are grateful to Rachel Banks, Stephanie Barnes, Curt Callan, Grifﬁn Chure, Ana Duarte, VaheGalstyan, Hernan Garcia, Soichi Hirokawa, Thomas Lecuit, Heun Jin Lee, Madhav Mani, NicholasMcCarty, Muir Morrison, Steve Quake, Tom R ¨oschinger, Manuel Razo-Mejia, Gabe Salmon, andGuillaume Urtecho for useful discussion and feedback on the manuscript. Guillaume Urtechoand Sri Kosuri have been instrumental in providing key advice and protocols at various stagesin the development of this work. We would like to thank Jost Vielmetter and Nina Budaeva forproviding access to their Cell Disruptor. Brett Lomenick provided crucial help and advice withprotein preparation. We also thank Igor Antoshechkin for his help with sequencing at the Caltech26enomics Facility.Funding: We are deeply grateful for support from NIH Grants DP1 OD000217 (Director’sPioneer Award) and 1R35 GM118043-01 (Maximizing Investigators Research Award) which madeit possible to undertake this multi-year project. N.M.B. was supported by an HHMI InternationalStudent Research Fellowship. S.M.B was supported by the NIH Institutional National ResearchService Award (5T32GM007616-38) provided through Caltech. The Proteome Exploration Labora-tory is supported by, the Beckman Institute, and NIH 1S10OD02001301.

References [1] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold.Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Nature Methods , 5(7):621–628, 2008.[2] Tim Stuart and Rahul Satija. Integrative single-cell analysis.

Nature Reviews Genetics , 20:257–272, 2019.[3] Emily C. A. Goodall, Ashley Robinson, Iain G. Johnston, Sara Jabbari, Keith A. Turner,Adam F. Cunningham, Peter A. Lund, Jeffrey A. Cole, and Ian R. Henderson. The essentialgenome of

Escherichia coli k-12. mBio , 9(1), 2018.[4] Ye Gao, James T Yurkovich, Sang Woo Seo, Ilyas Kabimoldayev, Ke Chen, Anand V Sastry,Xin Fang, Nathan Mih, Laurence Yang, Johannes Eichner, Byung-kwan Cho, Donghyuk Kim,and Bernhard O Palsson. Systematic discovery of uncharacterized transcription factors in

Escherichia coli

K-12 MG1655.

Nucleic Acids Research , 46(20):10682–10696, 2018.[5] R. P. Patwardhan, C. Lee, O. Litvin, D. L. Young, D. Pe’er, and J. Shendure. High-resolutionanalysis of DNA regulatory elements by synthetic saturation mutagenesis.

Nature Biotechnol-ogy , 27(12):1173–1175, 2009.[6] Justin B Kinney, Anand Murugan, Curtis G Callan, and Edward C Cox. Using deep se-quencing to characterize the biophysical mechanism of a transcriptional regulatory sequence.

Proceedings of the National Academy of Sciences of the United States of America , 107(20):9158–9163,2010.[7] E. Sharon, Y. Kalma, A. Sharp, T. Raveh-Sadka, M. Levo, D. Zeevi, L. Keren, Z. Yakhini,A. Weinberger, and E. Segal. Inferring gene regulatory logic from high-throughput measure-ments of thousands of systematically designed promoters.

Nature Biotechnology , 30(6):521–30,2012.[8] R. P. Patwardhan, J. B. Hiatt, D. M. Witten, M. J. Kim, R. P. Smith, D. May, C. Lee, J. M. Andrie,S. I. Lee, G. M. Cooper, N. Ahituv, L. A. Pennacchio, and J. Shendure. Massively parallelfunctional dissection of mammalian enhancers in vivo . Nature Biotechnology , 30(3):265–70,2012.[9] Alexandre Melnikov, Anand Murugan, Xiaolan Zhang, Tiberiu Tesileanu, Li Wang, PeterRogov, Soheil Feizi, Andreas Gnirke, Curtis G Callan Jr, Justin B Kinney, Manolis Kellis,27ric S Lander, and Tarjei S Mikkelsen. Systematic dissection and optimization of inducibleenhancers in human cells using a massively parallel reporter assay.

Nature Biotechnology ,30(3):271–277, 2012.[10] J. C. Kwasnieski, I. Mogno, C. A. Myers, J. C. Corbo, and B. A. Cohen. Complex effectsof nucleotide variants in a mammalian cis-regulatory element.

Proc Natl Acad Sci U S A ,109(47):19498–503, 2012.[11] C. P. Fulco, J. Nasser, T. R. Jones, G. Munson, D. T. Bergman, V. Subramanian, S. R. Grossman,R. Anyoha, B. R. Doughty, T. A. Patwardhan, T. H. Nguyen, M. Kane, E. M. Perez, N. C.Durand, C. A. Lareau, E. K. Stamenova, E. L. Aiden, E. S. Lander, and J. M. Engreitz. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations.

Nature Genetics , 51(12):1664–1669, 2019.[12] Justin B. Kinney and David M. McCandlish. Massively parallel assays and quantitativesequencefunction relationships.

Annual Review of Genomics and Human Genetics , 20(1):99–127,2019.[13] Sriram Kosuri, Daniel B Goodman, Guillaume Cambray, Vivek K Mutalik, and Yuan Gao.Composability of regulatory sequences controlling transcription and translation in

Escherichiacoli . Proceedings of the National Academy of Sciences of the United States of America , 110(34), 2013.[14] G. Urtecho, A. D. Tripp, K. D. Insigne, H. Kim, and S. Kosuri. Systematic Dissection of Se-quence Elements Controlling sigma 70 Promoters Using a Genomically Encoded MultiplexedReporter Assay in

Escherichia coli . Biochemistry , 58(11):1539–1551, 2019.[15] Guillaume Urtecho, Kimberly Insigne, Arielle D Tripp, Marcia Brinck, Nathan B Lubock,Hwangbeom Kim, Tracey Chan, and Sriram Kosuri. Genome-wide functional characteri-zation of

Escherichia col promoters and regulatory elements responsible for their function. bioRxiv , 2020.[16] Manuel Razo-Mejia, Stephanie L. Barnes, Nathan M. Belliveau, Grifﬁn Chure, Tal Einav,Mitchell Lewis, and Rob Phillips. Tuning Transcriptional Regulation through Signaling: APredictive Theory of Allosteric Induction.

Cell Systems , 6(4):456–469.e10, 2018.[17] Grifﬁn Chure, Manuel Razo-Mejia, Nathan M. Belliveau, Tal Einav, Zoﬁi A. Kaczmarek,Stephanie L. Barnes, Mitchell Lewis, and Rob Phillips. Predictive shifts in free energy couplemutations to their phenotypic consequences.

Proceedings of the National Academy of Sciences ofthe United States of America , 116(37):18275–18284, 2019.[18] R. Phillips, N. M. Belliveau, G. Chure, H. G. Garcia, M. Razo-Mejia, and C. Scholes. Figure1 Theory Meets Figure 2 Experiments in the Study of Gene Expression.

Annual Review ofBiophysics , 48:121–163, 2019.[19] Stephanie L. Barnes, Nathan M. Belliveau, William T. Ireland, Justin B. Kinney, and RobPhillips. Mapping DNA sequence to transcription factor binding energy in vivo . PLoSComputational Biology , 15(2):1–29, 2019.[20] Nathan M. Belliveau, Stephanie L. Barnes, William T. Ireland, Daniel L. Jones, Michael J.Sweredoski, Annie Moradian, Sonja Hess, Justin B. Kinney, and Rob Phillips. Systematicapproach for dissecting the molecular mechanisms of transcriptional regulation in bacteria.28 roceedings of the National Academy of Sciences of the United States of America , 115(21):E4796–E4805, 2018.[21] Alberto Santos-Zavaleta, Heladia Salgado, Socorro Gama-castro, G Laura, Daniela Ledezma-tejeida, S Mishael, Santiago Garc, Kevin Alquicira-hern, Luis Jos, Pablo Pe, Cecilia Ishida-guti,David A Vel, Del Moral-ch, James Galagan, and Julio Collado-vides. Regulondb v 10.5:tackling challenges to unify classic and high throughput knowledge of gene regulation in

E.coli K-12 . Nucleic Acids Research , 47:212–220, 2019.[22] Alexander Schmidt, Karl Kochanowski, Silke Vedelaar, Erik Ahrne, Benjamin Volkmer,Luciano Callipo, Kevin Knoops, Manuel Bauer, Ruedi Aebersold, and Matthias Heinemann.The quantitative and condition-dependent

Escherichia coli proteome.

Nature Biotechnology ,34(1):104–110, 2015.[23] Socorro Gama-Castro, Heladia Salgado, Alberto Santos-Zavaleta, Daniela Ledezma-Tejeida, Luis Mu ˜niz-Rascado, Jair Santiago Garc´ıa-Sotelo, Kevin Alquicira-Hern ´andez,Irma Mart´ınez-Flores, Lucia Pannier, Jaime Abraham Castro-Mondrag ´on, AlejandraMedina-Rivera, Hilda Solano-Lira, C´esar Bonavides-Mart´ınez, Ernesto P´erez-Rueda,Shirley Alquicira-Hern´andez, Liliana Porr ´on-Sotelo, Alejandra L ´opez-Fuentes, AnastasiaHern´andez-Koutoucheva, V´ıctor Del Moral-Chavez, Fabio Rinaldi, and Julio Collado-Vides.RegulonDB version 9.0: High-level integration of gene regulation, coexpression, motifclustering and beyond.

Nucleic Acids Research , 44(D1):D133–D143, 2016.[24] G. Mittler, F. Butter, and M. Mann. A SILAC-based DNA protein interaction screen thatidentiﬁes candidate binding proteins to functional DNA elements.

Genome Res , 19(2):284–93,2009.[25] Alfredo Mendoza-Vargas, Leticia Olvera, Maricela Olvera, Ricardo Grande, Leticia Vega-Alvarado, Blanca Taboada, Vernica Jimenez-Jacinto, Heladia Salgado, Katy Jurez, BrunoContreras-Moreira, Araceli M. Huerta, Julio Collado-Vides, and Enrique Morett. Genome-Wide Identiﬁcation of Transcription Start Sites, Promoters and Transcription Factor BindingSites in

E. coli . PLoS ONE , 4(10):e7526, October 2009.[26] O. G. Berg and P. H. von Hippel. Selection of DNA binding sites by regulatory pro-teins. Statistical-mechanical theory and application to operators and promoters.

J MolBiol , 193(4):723–50, 1987.[27] G. D. Stormo and D. S. Fields. Speciﬁcity, free energy and information content in protein-DNA interactions.

Trends Biochem Sci , 23(3):109–13, 1998.[28] Thomas D. Schneider and R.Michael Stephens. Sequence logos: a new way to displayconsensus sequences.

Nucleic Acids Research , 18(20):6097–6100, 1990.[29] M. Rydenfelt, H. G. Garcia, R. S. Cox III, and R. Phillips. The inﬂuence of promoter architec-tures and regulatory motifs on gene expression in

Escherichia coli . PLoS One , 9(12):e114347,2014.[30] Franqois Jacob and Jacques Monod. On the regulation of gene activity.

Cold Spring HarborSymposia on Quantitative Biology , 26:19, 1961.2931] Ingrid M. Keseler, Amanda Mackie, Alberto Santos-Zavaleta, Richard Billington, C´esarBonavides-Mart´ınez, Ron Caspi, Carol Fulcher, Socorro Gama-Castro, Anamika Kothari,Markus Krummenacker, Mario Latendresse, Luis Mu ˜niz-Rascado, Quang Ong, SuzannePaley, Martin Peralta-Gil, Pallavi Subhraveti, David A. Vel´azquez-Ram´ırez, Daniel Weaver,Julio Collado-Vides, Ian Paulsen, and Peter D. Karp. The EcoCyc database: reﬂecting newknowledge about escherichia coli

K-12.

Nucleic Acids Research , 45(D1):D543–D550, 2016.[32] Sankha Ghatak, Zachary A. King, Anand Sastry, and Bernhard O. Palsson. The y-ome deﬁnesthe 35% of

Escherichia coli genes that lack experimental evidence of function.

Nucleic AcidsResearch , 47(5):2446–2454, 2019.[33] Jonathan D. Partridge, Diane M. Bodenmiller, Michael S. Humphrys, and Stephen Spiro.NsrR targets in the

Escherichia coli genome: new insights into DNA sequence requirements forbinding and a role for NsrR in the regulation of motility.

Molecular Microbiology , 73(4):680–694,2009.[34] Kyu Y. Rhee, Donald F. Senear, and G. Wesley Hatﬁeld. Activation of Gene Expression by aLigand-induced Conformational Change of a Protein-DNA Complex.

Journal of BiologicalChemistry , 273(18):11257–11266, May 1998.[35] J. M. Vilar and S. Leibler. DNA looping and physical constraints on transcription regulation.

J Mol Biol , 331(5):981–9, 2003.[36] J. M. Vilar, C. C. Guet, and S. Leibler. Modeling network dynamics: the lac operon, a casestudy.

J Cell Biol , 161(3):471–6, 2003.[37] Lacramioara Bintu, Nicolas E Buchler, Hernan G Garcia, Ulrich Gerland, Terence Hwa, JanKondev, and Rob Phillips. Transcriptional regulation by the numbers: models.

CurrentOpinion in Genetics & Development , 15(2):116–124, April 2005.[38] Hernan G. Garcia and Rob Phillips. Quantitative dissection of the simple repression input-output function.

Proceedings of the National Academy of Sciences , 108(29):12173–12178, July2011.[39] J. M. Vilar and L. Saiz. Reliable prediction of complex phenotypes from a modular design infree energy space: an extensive exploration of the lac operon.

ACS Synth Biol , 2(10):576–86,2013.[40] H Schweizer, W Boos, and T J Larson. Repressor for the sn-glycerol-3-phosphate regulonof

Escherichia coli K-12 : cloning of the glpR gene and identiﬁcation of its product.

Journal ofBacteriology , 161(2):563–566, 1985.[41] Heinz Krner, Heidi J. Soﬁa, and Walter G. Zumft. Phylogeny of the bacterial superfam-ily of Crp-Fnr transcription regulators: exploiting the metabolic spectrum by controllingalternative gene programs.

FEMS Microbiology Reviews , 27(5):559–592, December 2003.[42] Manika Kargeti and K. V. Venkatesh. The effect of global transcriptional regulators on theanaerobic fermentative metabolism of

Escherichia coli . Molecular BioSystems , 13(7):1388–1398,2017. 3043] H. K. Seoh and P. C. Tai. Catabolic repression of secB expression is positively controlled bycyclic AMP (cAMP) receptor protein-cAMP complexes at the transcriptional level.

Journal ofBacteriology , 181(6):1892–1899, March 1999.[44] Timothy J Larsons, Shanzhang Ye, Deborah L Weissenborn, and Heidi J Hoffmann. Puriﬁ-cation and Characterization of the Repressor for the sn-Glycerol 3-Phosphate Regulon of

Escherichia coli

K12.

Journal of Biological Chemistry , 262(33):15869–15874, 1987.[45] E. C. C. Lin. Glycerol dissimilation and its regulation in bacteria.

Annual Review of Microbiol-ogy , 30(1):535–578, 1976.[46] Kevin S. Myers, Huihuang Yan, Irene M. Ong, Dongjun Chung, Kun Liang, Frances Tran,S ¨und ¨uz Keles¸, Robert Landick, and Patricia J. Kiley. Genome-scale analysis of

Escherichiacoli fnr reveals complex features of transcription factor binding.

PLOS Genetics , 9(6):1–24, 062013.[47] Shobhit Gupta, John A. Stamatoyannopoulos, Timothy L. Bailey, and William Stafford Noble.Quantifying similarity between motifs.

Genome Biology , 8(2), 2007.[48] Tyrrell Conway, James P. Creecy, Scott M. Maddox, Joe E. Grissom, Trevor L. Conkle, Tyler M.Shadid, Jun Teramoto, Phillip San Miguel, Tomohiro Shimada, Akira Ishihama, HirotadaMori, and Barry L. Wanner. Unprecedented High-Resolution View of Bacterial OperonArchitecture Revealed by RNA Sequencing. mBio , 5(4):e01442–14, July 2014.[49] Ins Compan and Danlle Touati. Anaerobic activation of arcA transcription in

Escherichia coli :roles of Fnr and ArcA.

Molecular Microbiology , 11(5):955–964, 1994.[50] Douglas F Browning and Stephen J W Busby. Local and global regulation of transcriptioninitiation in bacteria.

Nature Reviews Microbiology , pages 638–650, 2016.[51] Janet E. Lindsley and Jared Rutter. Whence cometh the allosterome?

Proceedings of theNational Academy of Sciences of the United States of America , 103(28):10533–10535, 2006.[52] Nicolas E Buchler, Ulrich Gerland, and Terence Hwa. On schemes of combinatorial tran-scription logic.

Proceedings of the National Academy of Sciences , 100(9):5136–5141, April 2003.[53] A. Tareen and J. B. Kinney. Biophysical models of cis-regulation as interpretable neuralnetworks. bioRxiv , 2019.[54] F. R. Blattner. The Complete Genome Sequence of

Escherichia coli

K-12.

Science , 277(5331):1453–1462, September 1997.[55] Nishant Pappireddi, Lance Martin, and Martin W ¨uhr. A Review on Quantitative MultiplexedProteomics.

ChemBioChem , 20(10):1210–1224, 2019.[56] Jrgen Cox and Matthias Mann. MaxQuant enables high peptide identiﬁcation rates, indi-vidualized p.p.b.-range mass accuracies and proteome-wide protein quantiﬁcation.

NatureBiotechnology , 26(12):1367–1372, December 2008.[57] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practicaland Powerful Approach to Multiple Testing.

Journal of the Royal Statistical Society: Series B(Methodological) , 57(1):289–300, January 1995.3158] Natsuko Yamamoto, Kenji Nakahigashi, Tomoko Nakamichi, Mihoko Yoshino, Yuki Takai,Yae Touda, Akemi Furubayashi, Satoko Kinjyo, Hitomi Dose, Miki Hasegawa, Kirill ADatsenko, Toru Nakayashiki, Masaru Tomita, Barry L Wanner, and Hirotada Mori. Updateon the keio collection of

Escherichia coli single-gene deletion mutants.

Molecular SystemsBiology , 5:335–335, 2009.[59] William T. Ireland and Justin B. Kinney. MPAthic: Quantitative Modeling of Sequence-Function Relationships for massively parallel assays. preprint, Bioinformatics, May 2016.[60] Alexander Schmidt, Karl Kochanowski, Silke Vedelaar, Erik Ahrne, Benjamin Volkmer,Luciano Callipo, Kevin Knoops, Manuel Bauer, Ruedi Aebersold, and Matthias Heinemann.The quantitative and condition-dependent

Escherichia coli proteome.

Nature Biotechnology ,34(1):104–110, 2015.[61] Alfredo Mendoza-Vargas, Leticia Olvera, Maricela Olvera, Ricardo Grande, Leticia Vega-Alvarado, Blanca Taboada, Vernica Jimenez-Jacinto, Heladia Salgado, Katy Jurez, BrunoContreras-Moreira, Araceli M. Huerta, Julio Collado-Vides, and Enrique Morett. Genome-Wide Identiﬁcation of Transcription Start Sites, Promoters and Transcription Factor BindingSites in E. coli.

PLoS ONE , 4(10):e7526, October 2009.[62] Tanja Mago and Steven L. Salzberg. FLASH: fast length adjustment of short reads to improvegenome assemblies.

Bioinformatics , 27(21):2957–2963, November 2011.[63] Justin B Kinney, Anand Murugan, Curtis G Callan, and Edward C Cox. Using deep se-quencing to characterize the biophysical mechanism of a transcriptional regulatory sequence.

Proceedings of the National Academy of Sciences of the United States of America , 107(20):9158–9163,2010.[64] Nathan M. Belliveau, Stephanie L. Barnes, William T. Ireland, Daniel L. Jones, Michael J.Sweredoski, Annie Moradian, Sonja Hess, Justin B. Kinney, and Rob Phillips. Systematicapproach for dissecting the molecular mechanisms of transcriptional regulation in bacteria.

Proceedings of the National Academy of Sciences of the United States of America , 115(21):E4796–E4805, 2018.[65] Alberto Santos-Zavaleta, Heladia Salgado, Socorro Gama-castro, G Laura, Daniela Ledezma-tejeida, S Mishael, Santiago Garc, Kevin Alquicira-hern, Luis Jos, Pablo Pe, Cecilia Ishida-guti,David A Vel, Del Moral-ch, James Galagan, and Julio Collado-vides. Regulondb v 10.5:tackling challenges to unify classic and high throughput knowledge of gene regulation in

E.coli K-12 . Nucleic Acids Research , 47:212–220, 2019.[66] Ins Compan and Danlle Touati. Anaerobic activation of arcA transcription in Escherichiacoli: roles of Fnr and ArcA.

Molecular Microbiology , 11(5):955–964, 1994.[67] Rahul Kumar and Kazuyuki Shimizu. Transcriptional regulation of main metabolic pathwaysof cyoA, cydB, fnr, and fur gene knockout Escherichia coli in C-limited and N-limited aerobiccontinuous cultures.

Microbial Cell Factories , 10(1):3, 2011.[68] A M Easton and S R Kushner. Transcription of the uvrD gene of Escherichia coli is controlledby the lexA repressor and by attenuation.

Nucleic Acids Research , 11(24):8625–8640, December1983. 3269] Stephanie L. Barnes, Nathan M. Belliveau, William T. Ireland, Justin B. Kinney, and RobPhillips. Mapping DNA sequence to transcription factor binding energy in vivo . PLoSComputational Biology , 15(2):1–29, 2019.[70] Hernan G. Garcia and Rob Phillips. Quantitative dissection of the simple repression input-output function.

Proceedings of the National Academy of Sciences , 108(29):12173–12178, July2011.[71] Jrgen Cox and Matthias Mann. MaxQuant enables high peptide identiﬁcation rates, indi-vidualized p.p.b.-range mass accuracies and proteome-wide protein quantiﬁcation.

NatureBiotechnology , 26(12):1367–1372, December 2008.[72] J ¨urgen Cox, Ivan Matic, Maximiliane Hilger, Nagarjuna Nagaraj, Matthias Selbach, Jesper VOlsen, and Matthias Mann. A practical guide to the MaxQuant computational platform forSILAC-based quantitative proteomics.

Nature Protocols , 4(5):698–705, May 2009.[73] Shobhit Gupta, John A. Stamatoyannopoulos, Timothy L. Bailey, and William Stafford Noble.Quantifying similarity between motifs.

Genome Biology , 8(2), 2007.[74] Jonathan D. Partridge, Diane M. Bodenmiller, Michael S. Humphrys, and Stephen Spiro.NsrR targets in the Escherichia coli genome: new insights into DNA sequence requirementsfor binding and a role for NsrR in the regulation of motility.

Molecular Microbiology , 73(4):680–694, 2009.[75] Kyu Y. Rhee, Donald F. Senear, and G. Wesley Hatﬁeld. Activation of Gene Expression by aLigand-induced Conformational Change of a Protein-DNA Complex.

Journal of BiologicalChemistry , 273(18):11257–11266, May 1998.33 upplementary Information for “Deciphering the regulatorygenome of

Escherichia coli , one hundred promoters at a time”Contents

E. coli regulatory architectures . . . . . . . . . . . . . . . . . . . . 9

Extended details of experimental design

Genes in this study were chosen to cover several different categories. 29 genes had some informa-tion on their regulation already known to validate our method under a number of conditions.37 were chosen because the work of [60] demonstrated that gene expression changed signiﬁ-cantly under different growth conditions. A handful of genes such as minC , maoP , or fdhE werechosen because we found either their physiological signiﬁcance interesting, as in the case of thecell division gene minC or that we found the gene regulatory question interesting, such for theintra-operon regulation demonstrated by fdhE . The remainder of the genes were chosen becausethey had no regulatory information, often had minimal information about the function of thegene, and had an annotated transcription start site (TSS) in RegulonDB. A known limitation of the experiment is that the mutational window is limited to 160 bp. Assuch, it is important to correctly target the mutation window to the location around the mostactive TSS. To do this we ﬁrst prioritized those TSS which have been extensively experimentallyvalidated and catalogued in RegulonDB. Secondly we selected those sites which had evidence ofactive transcription from RACE experiments [61] and were listed in RegulonDB. If the intergenicregion was small enough, we covered the entire region with our mutation window. If none ofthese options were available, we used computationally predicted start sites.

All sequencing was carried out by either the Millard and Muriel Jacobs Genetics and GenomicsLaboratory at Caltech (HiSeq 2500) on a 100 bp single read ﬂow cell or using the sequencingservices from NGX Bio on a 250 bp or 150 base paired end ﬂow cell. The total library wasﬁrst sequenced by PCR amplifying the region containing the variant promoters as well as thecorresponding barcodes. This allowed us to uniquely associate each random 20 bp barcode witha promoter variant. Any barcode which was associated with a promoter variant with insertionsor deletions was removed from further analysis. Similarly, any barcode that was associated withmultiple promoter variants was also removed from the analysis. The paired end reads from thissequencing step were then assembled using the FLASH tool [62]. Any sequence with PHREDscore less than 20 was removed using the FastX toolkit. Additionally, when sequencing the initiallibrary, sequences which only appear in the dataset once were not included in further analysis inorder to remove possible sequencing errors.For all the MPRA experiments, only the region containing the random 20 bp barcode wassequenced, since the barcode can be matched to a speciﬁc promoter variant using the initial librarysequencing run described above. For a given growth condition, each promoter yielded 50,000to 500,000 usable sequencing reads. Under some growth conditions, genes were not analyzedfurther if they did not have at least 50,000 reads.To determine which base pair regions were statistically signiﬁcant a 99% conﬁdence intervalwas constructed using the MCMC inference to determine the uncertainty.35 .4 Growth conditions

The growth conditions studied in this study were inspired by [60] and include differing carbonsources such as growth in M9 with 0.5% Glucose, M9 with acetate (0.5%), M9 with arabinose(0.5%), M9 with Xylose (0.5%) and arabinose (0.5%), M9 with succinate (0.5%), M9 with fumarate(0.5%), M9 with Trehalose (0.5%), and LB. In each case cell harvesting was done at an OD of 0.3.These growth conditions were chosen so as to span a wide range of growth rates, as well as toilluminate any carbon source speciﬁc regulators.We also used several stress conditions such as heat shock, where cells were grown in M9 andwere subjected to a heat shock of 42 degrees for 5 minutes before harvesting RNA. We grew inlow oxygen conditions. Cells were grown in LB in a container with minimal oxygen, althoughsome will be present as no anaerobic chamber was used. This level of oxygen stress was stillsufﬁcient to activate FNR binding, and so activated the anaerobic metabolism. We also grew cellsin M9 with Glucose and 5mM sodium salycilate.Growth with zinc was preformed at a concentration of 5mM ZnCl and growth with iron waspreformed by ﬁrst growing cells to an OD of 0.3 and then adding FeCL to a concentration of5mM and harvesting RNA after 10 minutes. Growth without cAMP was accomplished by the useof the JK10 strain which does not maintain its cAMP levels.All knockout experiment were preformed in M9 with Glucose except for the knockouts for arcA , hdfR , and phoP which were grown in LB. The work presented here is effectively a third-generation of the use of Sort-Seq methods forthe discovery of regulatory architecture. The primary difference between the present work andprevious generations [63, 64] is the use of RNA-Seq rather than ﬂuorescence and cell sorting as areadout of the level of expression of our promoter libraries. As such, there are many importantquestions to be asked about the comparison between the earlier methods and this work. We attackthat question in several ways. First, as shown in Figure 1, we have performed a head-to-headcomparison of the two approaches to be described further in this section. Second, as shown in thenext section, our list of candidate promoters included roughly 20% for which the community hassome knowledge of their regulatory architecture. In these cases, we examined the extent to whichour methods recover the known features of regulatory control about those promoters.

As the basis for comparing the results of the ﬂuorescence-based Sort-Seq approach with ourRNA-Seq-based approach, we use information footprints, expression shifts and sequence logosas our metrics. Figure 1 shows examples of this comparison for four distinct genes of interest.Figure 1(A) shows the results of the two methods for the lacZYA promoter with special referenceto the CRP binding site. Both the information footprint and the sequence logo identify the same36inding site.Figure 1(B) provides a similar analysis for the dgoRKADT promoter where once again theinformation footprints and the sequence logos from the two methods are in reasonable accord.Figure 1(C) provides a quantitative dissection of the relBE promoter which is repressed by RelBE.Here we use both information footprints and expression shifts as a way to quantify the signiﬁ-cance of mutations to different binding sites across the promoter. Finally, Figure 1(D) shows acomparison of the two methods for the marRAB promoter. The two approaches both identify aMarR binding site.

In total, we have tested over 20 genes for which there is already some substantial regulatoryknowledge reported in the literature. The successes and failures of this test are detailed inFigure 2. For those promoters which have strong evidence of a binding site, as determinedby RegulonDB [65], we recover all relevant transcription factor binding sites for 12 out of 16cases, the majority of relevant binding sites for 2 out of 16 cases, and miss all or most of the regu-lation for just 2 promoters. We identify a total of 22 previously known high evidence binding sites.These results showcase that our method largely agrees with the established literature butalso highlights several areas in which our method is prone to missing regulatory elements. Onefailure mode is caused by the presence of strong secondary binding sites. For example, in the araC promoter, as shown in Figure 2(C), the only binding signatures that appear in the informationfootprint are from a secondary RNAP site. The secondary site seems to be expressed constitutively,and in the cases where the primary start site is even partially repressed, the secondary start sitewill dominate transcription and obscure the many binding sites that are in this promoter.If there are large numbers of regulatory elements, the data will often only show the few mostimportant elements. If we look at the marR promoter in Figure 2(C), we can only see the signatureof the two MarR sites even though CpxR, Fis, and CRP are all known to bind to the promoter.MarR is a strong enough repressor that mutating any of the other transcription factor sites isunlikely to meaningfully change gene expression unless the MarR site is also mutated. Thisillustrates that the regulatory architectures discovered in this study represent a lower bound onwhat exists in each promoter.Finally, for some genes such as dicA there was no known TSS prior to the experiment. Al-though there is a small regulatory region between dicA and its neighboring gene, this does notensure that we will include the strongest RNAP sites. Better mapping of transcription start sitescould improve our method.We next consider low evidence binding sites. Other research determined the locations ofthe low evidence sites through gene expression analysis and sequence comparison to consensussequences [66, 67, 68]. For 5 promoters in our list, the binding sites location itself is not known,only that the TF in question regulates the gene. For these promoters we recover the knownregulation in only 2 out of 15 cases. Comparison to consensus sequences can be unreliable andgenerate false positives when the entirety of the

E. coli genome is considered. Gene expression37

RP CRP i n f o r m a t i o n ( b i t s ) information(bits)information(bits)expressionshift information(bits)information(bits)expressionshift i n f o r m a t i o n ( b i t s ) ﬂ uorescentsortingMPRA ﬂ uorescentsortingMPRA ﬂ uorescentsortingMPRA ﬂ uorescentsorting ﬂ uorescentsortingMPRA ﬂ uorescentsorting ﬂ uorescentsortingMPRA ﬂ uorescentsortingMPRA ﬂ uorescentsortingMPRA ﬂ uorescentsortingMPRAr = 0.98r = 0.80 r = 0.78r = 0.90 –70 –60 –50 –40 020103– 0 30 40 50–20 –10–30 –20 –10 0 10 20 30 –30 –10 100 positionposition positionmutationdecreasesexpressionmutationincreasesexpression lacZYArelBE dgoRKADTmarRAB RelBE MarR MarRRBSRBS(A) (B)(C) (D)

Supplementary Figure 1: A summary of four direct comparisons of measurements using ﬂuores-cence and sorting and using RNA-Seq. (A) CRP binds upstream of RNAP in the lacZYA promoter.Despite the different measurement techniques for the two inferred energy matrices and theircorresponding sequence logos, the CRP binding sites have a Pearson correlation coefﬁcient of r = dgoRKADT promoter is activated by CRP in the presence of galactonate. TheFACS measurements were taken in the JK10 strain in the presence of 500mM cAMP. In bothcases, a type II activator binding site can be identiﬁed based on the signals in the informationfootprint in the area indicated in green. Additionally the quantitative agreement between theCRP binding preference matrices are strong, with r = relBE promoter is repressedby RelBE. The inferred matrices between the two measurement methods have r = marRAB promoter is repressed by MarR. The features we can observe in the information footprintreﬂect this under measurement with both FACS or RNAseq. The inferred energy matrices (datanot shown) and sequence logos shown have r = A) (B)(C) i n f o r m a t i o n ( b i t s ) N u m b e r o f p r o m o t e r s arcA bdcR araAB bdcR znuA FNRCRPCRP ZurDicAMarRAraC AraC bdcR xylF

XylR XylRXylR znuCB TFs recovered correctly Majority of TFs recovered correctly TFs notrecoveredHigh evidence binding sites xylA

XylR bdcR dicCmarR relBErspA

LexA ftsK

CRP bdcR ompR bdcR lac

IHFIHF uvrD

LexA N u m b e r o f p r o m o t e r s TFs recovered correctly Majority of TFs recovered correctly TFs notrecoveredLow evidence binding sites araC

AraC AraC AraC AraCXylRCRP

RNAPRNAP RNAP RNAPRNAPRNAP RNAP RNAPRNAP

RNAP

RNAPMarRRelBE RelBERelBE RNAPYdfH C R P A r a C Zur -95 RNAPRNAP

Supplementary Figure 2: Reg-Seq analysis of “gold standard” promoters. (A) Informationfootprints for known and properly recovered binding sites. (B) A summary of how well theReg-Seq results conform to literature results. The sites that are low evidence in the literature aredetermined by RegulonDB [65]. (C) The information footprint and known binding sites for the araC promoter. Despite all the binding sites present, the only binding signature that appears is forRNAP. 39nalysis alone has difﬁculty ruling out indirect effects of a given transcription factor on geneexpression and regulation determined by this method may occur outside of the 160 bp mutationwindow we consider. As our results recover high evidence sites well, the poor recovery of sitesbased on sequence gazing and gene expression analysis most likely indicates that these methodsare unreliable for determining binding locations.We note that the ﬁrst aim of our methods is regulatory discovery. We would like to be ableto determine how previously uncharacterized promoters are regulated and ultimately, this is aquestion of binding-site and transcription factor identiﬁcation. For that task, we do not requireperfect correspondence between the two methods. With regulatory sites identiﬁed, our nextobjective is the determination of energy matrices that will allow us to turn binding site strengthinto a tunable knob that can nearly continuously tune the strength of transcription factor binding,thus altering gene expression in predictable ways as already shown in our earlier work [69]. Ther-values between energy matrices range from 0.78 to 0.96, indicating reasonable to very goodagreement. Reg-Seq appears to be, if anything, more accurate than previous methods as it hashigher relative information content in known areas of transcription factor binding and also doesnot have repressor-like bases on CRP sites as in Figure 1(A) and (B).

We use information footprints as a tool for hypothesis generation to identify regions which maycontain transcription factor binding sites. In general, a mutation within a transcription factor siteis likely to severely weaken that site. We look for groups of positions where mutation away fromwild type has a large effect on gene expression. Our data sets consist of nucleotide sequences, thenumber of times we sequenced the construct in the plasmid library, and the number of times wesequenced its corresponding mRNA. A simpliﬁed data set on a 4 nucleotide sequence then mightlook like Sequence Library Sequencing Counts mRNA CountsACTA 5 23ATTA 5 3CCTG 11 11TAGA 12 3GTGC 2 0CACA 8 7AGGC 7 3One possible calculation to measure the impact of a given mutation on expression is to takeall sequences which have base b at position i and determine the number of mRNAs produced perread in the sequencing library. By comparing the values for different bases we could determinehow large of an effect mutation has on gene expression. However, in this paper we will use mutualinformation to quantify the effect of mutation, as [63] demonstrated could be done successfully.In Table 1 the frequency of the different nucleotides in the library at position 2 is 40% A, 32% C,404% G and 14% T. Cytosine is enriched in the mRNA transcripts over the original library, as itnow composes 68% of all mRNA sequencing reads while A, G, and T only compose only 20%,6%, and 6% respectively. Large enrichment of some bases over others occurs when base identityis important for gene expression. We can quantify how important using the mutual informationbetween base identity and gene expression level. Mutual information is given at position i by I b = ∑ m = ∑ µ = p ( m , µ ) log (cid:18) p ( m , µ ) p mut ( m ) p expr ( µ ) (cid:19) . (5) p mut ( m ) in equation 5 refers to the probability that a given sequencing read will be from amutated base. p expr ( µ ) is a normalizing factor that gives the ratio of the number of DNA ormRNA sequencing counts to total number of counts.The mutual information quantiﬁes how much a piece of knowledge reduces the entropy of adistribution. At a position where base identity matters little for expression level, there would belittle difference in the frequency distributions for the library and mRNA transcripts. The entropyof the distribution would decrease only by a small amount when considering the two types ofsequencing reads separately.We are interested in quantifying the degree to which mutation away from a wild type sequenceaffects expression. Although their are obviously 4 possible nucleotides, we can classify each baseas either wild-type or mutated so that b in equation 5 represents only these two possibilities.If mutations at each position are not fully independent, then the information value calculatedin equation 5 will also encode the effect of mutation at correlated positions. If having a mutation atposition 1 is highly favorable for gene expression and is also correlated with having a mutation atposition 2, mutations at position 2 will also be enriched amongst the mRNA transcripts. Position2 will appear to have high mutual information even if it has minimal effect on gene expression.Due to the DNA synthesis process used in library construction, mutation in one position canmake mutation at other positions more likely by up to 10 percent. This is enough to cloud thesignature of most transcription factors in an information footprint calculated using equation 5.We need to determine values for p i ( m | µ ) when mutations are independent, and to do this weneed to ﬁt these quantities from our data. We assert that (cid:104) mRN A (cid:105) ∝ e − β E e f f (6)is a reasonable approximation to make. (cid:104) mRN A (cid:105) is the average number of mRNAs producedby that sequence for every cell containing the construct and E e f f is an effective energy for thesequence that can be determined by summing contributions from each position in the sequence.There are many possible underlying regulatory architectures, but to demonstrate that our ap-proach is reasonable let us ﬁrst consider the simple case where there is only a RNAP site in thestudied region. We can write down an expression for average gene expression per cell as (cid:104) mRN A (cid:105) ∝ p bound ∝ pN NS e − β E P + pN NS e − β E P (7)Where p bound is the probability that the RNAP is bound to DNA and is known to be proportionalto gene expression in E. coli [70], E P is the energy of RNAP binding, N NS is the number of41onspeciﬁc DNA binding sites, and p is the number of RNAP. If RNAP binds weakly then pN NS e − β E P <<

1. We can simplify equation 7 to (cid:104) mRN A (cid:105) ∝ e − β E P . (8)If we assume that the energy of RNAP binding will be a sum of contributions from each of thepositions within its binding site then we can calculate the difference in gene expression betweenhaving a mutated base at position i and having a wild type base as (cid:10) mRN A WT i (cid:11)(cid:10) mRN A Mut i (cid:11) = e − β E PWTi e − β E PMuti (9) (cid:10) mRN A WT i (cid:11)(cid:10) mRN A Mut i (cid:11) = e − β ( E PWTi − E PMuti ) . (10)In this example we are only considering single mutation in the sequence so we can furthersimplify the equation to (cid:10) mRN A WT i (cid:11)(cid:10) mRN A Mut i (cid:11) = e − β ∆ E Pi . (11)We can now calculate the base probabilities in the expressed sequences. If the probability ofﬁnding a wild type base at position i in the DNA library is p i ( m = W T | µ = ) then p i ( m = W T | µ = ) = p i ( m = W T | µ = ) (cid:104) mRNA WTi (cid:105)(cid:104) mRNA

Muti (cid:105) p i ( m = Mut | µ = ) + p i ( m = W T | µ = ) (cid:104) mRNA WTi (cid:105) (cid:104) mRNA

Mut (cid:105) (12) p i ( m = W T | µ = ) = p i ( m = W T | µ = ) e − β ∆ E Pi p i ( m = Mut | µ = ) + p i ( m = W T | µ = ) e − β ∆ E Pi . (13)Under certain conditions, we can also infer a value for p i ( m | µ = ) using a linear model whenthere are any number of activator or repressor binding sites. We will demonstrate this in the caseof a single activator and a single repressor, although a similar analysis can be done when thereare greater numbers of transcription factors. We will deﬁne P = pN NS e − β E P . We will also deﬁne A = aN NS e − β E A where a is the number of activators, and E A is the binding energy of the activator.We will ﬁnally deﬁne R = rN NS e − β E R where r is the number of repressors and E R is the bindingenergy of the repressor. We can write (cid:104) mRN A (cid:105) ∝ p bound ∝ P + PAe − β(cid:101) AP + A + P + R + PAe − β(cid:101) AP (14)If activators and RNAP bind weakly but interact strongly, and repressors bind very strongly,then we can simplify equation 14. In this case A << P << PAe − (cid:101) AP >> P , and R >> mRN A (cid:105) ∝ PAe − β(cid:101) AP R (15) (cid:104) mRN A (cid:105) ∝ e − β ( − E P − E A + E R ) (16)As we typically assume that RNAP binding energy, activator binding energy, and repressor bind-ing can all be represented as sums of contributions from their constituent bases, the combinationof the energies can be written as a total effective energy E e f f which is a sum of contributions fromall positions within the binding sites.We ﬁt the parameters for each base using a Markov Chain Monte Carlo Method. Two MCMCruns are conducted using randomly generated initial conditions. We require both chains to reachthe same distribution to prove the convergence of the chains. We do not wish for mutation rate toaffect the information values so we set the p ( W T ) = p ( Mut ) =

Mass spectrometry results were processed using MaxQuant [71] [72]. Spectra were searchedagainst the UniProt

E. coli

K-12 database as well as a contaminant database (256 sequences).LysC was speciﬁed as the digestion enzyme. Proteins were considered if they were known to betranscription factors, or were predicted to bind DNA (using gene ontology term GO:0003677, forDNA-binding in BioCyc). zapAB -10RNAP region. Each sub-sampling was performed 3 times. The results, as displayed in Figure 3,show that there is only a small effect on the resulting sequence logo until the library has beenreduced to approximately 500 promoter variants.

In some cases, we used an alternative approach to mass spectrometry to discover the TF identityregulating a given promoter based on sequence analysis using a motif comparison tool. TOMTOM[73] is a tool that uses a statistical method to infer if a putative motif resembles any previouslydiscovered motif in a database. Of interest, it accounts for all possible offsets between the motifs.Moreover, it uses a suite of metrics to compare between motifs such as Kullback-Leibler diver-gence, Pearson correlation, euclidean distance, among others.43 umber of Unique Promoter Variants P e a r s o n R Sequence Logo -11 -7Position(A) (B)

Supplementary Figure 3: A comparison of RNAP -10 site sequence logos. (A) This ﬁgure showsthe Pearson correlation coefﬁcient between the energy matrix models inferred from the full dataset(2200 unique promoter variants) and that from a computationally restricted dataset. (B) Sequencelogos of the RNAP -10 region from each sub-sampled dataset.We performed comparisons of the motifs generated from our energy matrices to those gener-ated from all known transcription factor binding sites in RegulonDB. Figure 4 shows a result ofTOMTOM, where we compared the motif derived from the -35 region of the ybjX promoter andfound a good match with the motif of PhoP from RegulonDB.The information derived from this approach was then used to guide some of the TF knockoutexperiments, in order to validate its interaction with a target promoter characterized by the loss ofthe information footprint. Furthermore, we also used TOMTOM to search for similarities betweenour own database of motifs, in order to generate regulatory hypotheses in tandem. This wasparticularly useful when looking at the group of GlpR binding sites found in this experiment.

In addition to discovering new binding sites, we have discovered additional functions of knownbinding sites. In particular, in the case of bdcR , the repressor for the divergently transcribedgene bdcA [74], is also shown to repress bdcR in Figure 5(A). Similarly in Figure 5(B) IvlY isshown to repress ilvC in the absence of inducer. Divergently transcribed operons that shareregulatory regions are plentiful in

E. coli , and although there are already many known examplesof transcription factor binding sites regulating several different operons, there are almost certainlymany examples of this type of transcription that have yet to be discovered.Multi-purpose binding sites allow for more genes to be regulated with fewer binding sites.However, they can also serve to sharpen the promoter’s response to environmental cues. In thecase of ilvC , IlvY is known to activate ilvC in the presence of inducer. However, we now see that44 bjX (A)(B) C T G C T A G T C A T C A A T C T C G A C T G A G C T A A T A T C G G T C A A G C T C G T A G C T A C A A G T C G C T A G A T T A G C A T A A T G C T A C G C G A G C T A C G A T G A T C C T G A C A G T T G A C C A G C A T C T C G A regulonDB PhoP-35 ybjX upstream ybjX G T C A T A A T A G C TCG C G T A C G A T C G T A A T G C G C A G C T T G A C G A C G T A A C T C T G A G −4−3−2−10 -1 PhoP p v a l u e Supplementary Figure 4: Motif comparison using TOMTOM. Searching our energy motifs againstthe RegulonDB database using TOMTOM allowed us to guide our TF knockout experiments.Here we show the sequence logos of the PhoP transcription factor from RegulonDB (top) and theone generated from the ybjX promoter energy matrix. E-value = 0.01 using Euclidean distance asa similarity matrix. i n f o r m a t i o n ( b i t s ) i n f o r m a t i o n ( b i t s ) position mutationdecreasesexpressionmutationincreasesexpression bdcR bdcRbdcA IlvY position ilvC

NsrR ilvCilvY

IlvY

RNAP RNAP

NsrR

Supplementary Figure 5: Two cases in which we see transcription factor binding sites that wehave found to regulate both of the two divergently transcribed genes.45t also represses the promoter in the absence of that inducer. The production of ilvC is knownto increase by approximately a factor of 100 in the presence of inducer [75]. The magnitude ofthe change is attributed to the cooperative binding of two IlvY binding sites, but the loweredexpression of the promoter due to IlvY repression in the absence of inducer is also a factor. bdcR yecE

NsrRYiePFNRDeoR FNR FNR ArcA Zur YciTLexA DeoRCRP YdfHLexAIlvY ftsK dusC thiM rspAyedJ pcmrapA maoP ycbZrhlEykgE fdhE uvrD targroSL ilvC WaaA ybeZmscM leuABCD fdoH ybjXaphA mscL znuA phnAsdiA yehS

PhoPHNS IHF IHFStpAGlpRGlpRPhoPHdfRGlpR YgbI PhoP ompR

MarR MarR marRtff-rpsB-tsftig yjjJ rlmA minC ybiO iapznuCBymgGaraAB rcsF dinJxylFxylAidnKmscS aegAdicCarcA asnAcraaraC

AraCGlpRFNR DicAGlpR XylRXylR XylRCRPXylRYgbI ZnuACRP FNR ydjAmutMycgByeiQyicIyehU

FNR

Supplementary Figure 6: All regulatory cartoons for genes considered in our study.46 .3 Comparison of results to regulonDB

10 promoter architecture P e r c e n t a g e (0, 1) (0, 1) (0, 2) (2, 0) (1, 1) (1, 2) (2, 2)(2, 1)20304050 Reg-SeqregulonDB(0, 1) (0, 1) (0, 2) (2, 0) (1, 1) (1, 2) (2, 2)(2, 1)20304050 Reg-SeqregulonDB