[PDF] How the Avidity of Polymerase Binding to the -35/-10 Promoter Sites Affects Gene Expression

Abstract

Although the key promoter elements necessary to drive transcription in Escherichia coli have long been understood, we still cannot predict the behavior of arbitrary novel promoters, hampering our ability to characterize the myriad of sequenced regulatory architectures as well as to design novel synthetic circuits. This work builds upon a beautiful recent experiment by Urtecho et al. who measured the gene expression of over 10,000 promoters spanning all possible combinations of a small set of regulatory elements. Using this data, we demonstrate that a central claim in energy matrix models of gene expression - that each promoter element contributes independently and additively to gene expression - contradicts experimental measurements. We propose that a key missing ingredient from such models is the avidity between the -35 and -10 RNA polymerase binding sites and develop what we call a refined energy matrix model that incorporates this effect. We show that this the refined energy matrix model can characterize the full suite of gene expression data and explore several applications of this framework, namely, how multivalent binding at the -35 and -10 sites can buffer RNAP kinetics against mutations and how promoters that bind overly tightly to RNA polymerase can inhibit gene expression. The success of our approach suggests that avidity represents a key physical principle governing the interaction of RNA polymerase to its promoter.

Full PDF

HHow the Avidity of Polymerase Binding to the -35/-10 Promoter SitesAffects Gene Expression

Tal Einav , ∗ , Rob Phillips , , , ∗ Department of Physics, California Institute of Technology, Pasadena, CA, 91125, USA Department of Applied Physics, California Institute of Technology, Pasadena, CA, 91125, USA Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA,91125, USA*Corresponding author: [email protected]; (512) 468-1994; 1200 E. California Blvd, MC 103-33,Pasadena, CA 91125 *Corresponding author: [email protected]; (626) 395-3374; 1200 E.California Blvd, MC 128-95, Pasadena, CA 91125

Abstract

Although the key promoter elements necessary to drive transcription in

Escherichia coli have long beenunderstood, we still cannot predict the behavior of arbitrary novel promoters, hampering our ability tocharacterize the myriad of sequenced regulatory architectures as well as to design new synthetic circuits.This work builds on a beautiful recent experiment by Urtecho et al. who measured the gene expressionof over 10,000 promoters spanning all possible combinations of a small set of regulatory elements. Usingthis data, we demonstrate that a central claim in energy matrix models of gene expression – that eachpromoter element contributes independently and additively to gene expression – contradicts experimentalmeasurements. We propose that a key missing ingredient from such models is the avidity between the-35 and -10 RNA polymerase binding sites and develop what we call a refined energy matrix model thatincorporates this effect. We show that this the refined energy matrix model can characterize the fullsuite of gene expression data and explore several applications of this framework, namely, how multivalentbinding at the -35 and -10 sites can buffer RNAP kinetics against mutations and how promoters thatbind overly tightly to RNA polymerase can inhibit gene expression. The success of our approach suggeststhat avidity represents a key physical principle governing the interaction of RNA polymerase to itspromoter.

Significance Statement

Cellular behavior is ultimately governed by the genetic program encoded in its DNA and through thearsenal of molecular machines that actively transcribe its genes, yet we lack the ability to predict how anarbitrary DNA sequence will perform. To that end, we analyze the performance of over 10,000 regulatorysequences and develop a model that can predict the behavior of any sequence based on its composition.By considering promoters that only vary by one or two elements, we can characterize how differentcomponents interact, providing fundamental insights into the mechanisms of transcription.1 a r X i v : . [ q - b i o . S C ] A p r ntroduction Promoters modulate the complex interplay of RNA polymerase (RNAP) and transcription factor bindingthat ultimately regulates gene expression. While our knowledge of the molecular players that mediatethese processes constantly improves, more than half of all promoters in

Escherichia coli still have noannotated transcription factors in RegulonDB (1) and our ability to design novel promoters that elicit atarget level of gene expression remains limited.As a step towards taming the vastness and complexity of sequence space, the recent development ofmassively parallel reporter assays has enabled entire libraries of promoter mutants to be simultaneouslymeasured (2–4). Given this surge in experimental prowess, the time is ripe to reexamine how well ourmodels of gene expression can extrapolate the response of a general promoter.A common approach to quantifying gene expression, called the energy matrix model , assumes thatevery promoter element contributes additively and independently to the total RNAP (or transcriptionfactor) binding energy (3). This model treats all base pairs on an equal footing and does not incorporatemechanistic details of RNAP-promoter interactions such as its strong binding primarily at the -35 and-10 binding motifs (shown in Fig. 1A). A newer method recently took the opposite viewpoint, designingan RNAP energy matrix that only includes the -35 element, -10 element, and the length of the spacerseparating them (5), neglecting the sequence composition of the spacer or the surrounding promoterregion.Although these methods have been successfully used to identify important regulatory elements inunannotated promoters (6) and predict evolutionary trajectories (5), it is clear that there is more to thestory. Even in the simple case of the highly-studied lac promoter, such energy matrices show systematicdeviations from measured levels of gene expressions, indicating that some fundamental component oftranscriptional regulation is still missing (7).We propose that one failure of current models lies in their tacit assumption that every promoterelement contributes independently to the RNAP binding energy. By naturally relaxing this assumptionto include the important effects of avidity, we can push beyond the traditional energy matrix analysis inseveral key ways including: ( i ) We can identify which promoter elements contribute independently orcooperatively without recourse to fitting, thereby building an unbiased mechanistic model for systemsthat bind at multiple sites. ( ii ) Applying this approach to RNAP-promoter binding reveals that the-35 and -10 motifs bind cooperatively, a feature that we attribute to avidity. Moreover, we show thatmodels that instead assume the -35 and -10 elements contribute additively and independently sharplycontradict the available data. ( iii ) We show that the remaining promoter elements (the spacer, UP,and background shown in Fig. 1A) do contribute independently and additively to the RNAP bindingenergy and formulate the corresponding model for transcriptional regulation that we call a refined energymatrix model . ( iv ) We use this model to explore how the interactions between the -35 and -10 elementscan buffer RNAP kinetics against mutations. ( v ) We analyze a surprising feature of the data whereoverly-tight RNAP-promoter binding can lead to decreased gene expression. ( vi ) We validate our modelby analyzing the gene expression of over 10,000 promoters in E. coli recently published by Urtecho et al. (8) and demonstrate that our framework markedly improves upon the traditional energy matrixanalysis.While this work focuses on RNAP-promoter binding, its implications extend to general regulatoryarchitectures involving multiple tight-binding elements including transcriptional activators that makecontact with RNAP (CRP in the lac promoter (9)), transcription factors that oligomerize (as recentlyidentified for the xylE promoter (6)), and transcription factors that bind to multiple sites on the promoter(DNA looping mediated by the Lac repressor (10)). More generally, this approach of categorizing whichbinding elements behave independently (without resorting to fitting) can be applied to multivalentinteractions in other biological contexts including novel materials, scaffolds, and synthetic switches (11,12). 2 esults

The -35 and -10 Binding Sites give rise to Gene Expression that Defies Char-acterization as Independent and Additive Components

Decades of research have shed light upon the exquisite biomolecular details involved in bacterialtranscriptional regulation via the family of RNAP σ factors (13). In this work, we restrict our attentionto the σ holoenzyme (8), the most active form under standard E. coli growth condition, whoseinteraction with a promoter includes direct contact with the -35 and -10 motifs (two hexamers centeredroughly 10 and 35 bases upstream of the transcription start site), a spacer region separating these twomotifs, an UP element just upstream of the -35 motif that anchors the C-terminal domain ( α CTD) ofRNAP, and the background promoter sequence surrounding these elements.Urtecho et al. constructed a library of promoters composed of every combination of eight -35 motifs,eight -10 motifs, eight spacers, eight backgrounds (BG), and three UP elements (Fig. 1A) (8). Eachsequence was integrated at the same locus within the

E. coli genome and transcription was quantifiedvia DNA barcoding and RNA sequencing. One of the three UP elements considered was the absence ofan UP binding motif, and this case will serve as the starting point for our analysis.The traditional energy matrix approach used by Urtecho et al. posits that every base pair of thepromoter will contribute additively and independently to the RNAP binding energy (8), which byappropriately grouping base pairs is equivalent to stating that the free energy of RNAP binding willbe the sum of its contributions from the background, spacer, -35, and -10 elements (see Appendix A).Hence, the gene expression (GE) is given by the Boltzmann factorGE ∝ e − β ( E BG + E Spacer + E -35 + E -10 ) . (1)Note that all E j s represent free energies (with an energetic and entropic component); to see the explicitdependence on RNAP copy number, refer to Appendix A. Fitting the 32 free energies (one for eachbackground, spacer, -35, and -10 element) and the constant of proportionality in Eq. 1 enables us topredict the expression of 8 × × × ,

096 promoters.Fig. 2A demonstrates that Eq. 1 leads to a poor characterization of these promoters ( R = 0 . r max and for very weak promoters at r (caused by background noise or serendipitousnear-consensus sequences (5)), namely,GE = r + r max e − β ( E BG + E Spacer + E -35 + E -10 ) e − β ( E BG + E Spacer + E -35 + E -10 ) . (2)Since Eq. 2 still assumes that each promoter element contributes additively and independently to thetotal RNAP binding energy, it also makes sharp predictions that markedly disagrees with the data (seeAppendix C). Inspired by these inconsistencies, we postulated that certain promoter elements, mostlikely the -35 and -10 sites, may not contribute synergistically to RNAP binding.To that end, we consider a model for gene expression shown in Fig. 1B where RNAP can separatelybind to the -35 and -10 sites. RNAP is assumed to elicit a large level of gene expression r max when fullybound but the smaller level r when unbound or partially bound. Importantly, the Boltzmann weightof the fully bound state contains the free energy E int representing the avidity of RNAP binding to the-35 and -10 sites. Physically, avidity arises because unbound RNAP binding to either the -35 or -10sites gains energy but loses entropy, while this singly bound RNAP attaching at the other (-10 or -35)site again gains energy but loses much less entropy, as it was tethered in place rather than floating insolution. Hence we expect e − βE int (cid:29)

1, and including this avidity term implies that RNAP no longerbinds independently to the -35 and -10 sites.Our coarse-grained model of gene expression neglects the kinetic details of transcription wherebyRNAP transitions from the closed to open complex before initiating transcription. Instead, we assumethat there is a separation of timescales between the fast process of RNAP binding/unbinding to the3 pacerBG BGUp –10

AC B

TTG...ACT... ...

8x 3x 8x 8x 8x

GAA...GGA...

None ...

TTG...ATT... ...

TAT...GAT... ...

AGC...AAA...

Gene expressionof 12,288 sequences Unbound -35/-10 Bound( B -35,-10 )( U ) -10 Bound( B -10 )-35 Bound( B -35 ) Weight (No UP Element) r r r r max StateRNAP States (No UP Element) Expression –35

Figure 1. The bivalent nature of RNAP-promoter binding. (A) Gene expression was measuredfor RNAP promoters comprising any combination of -35, -10, spacer, UP, and background (BG)elements. (B) When no UP element is present, RNAP makes contact with the promoter at the -35 and-10 sites giving rise to gene expression r when unbound or partially bound and r max when fully bound.(C) Having two binding sites alters the dynamics of RNAP binding. k on represents the on-rate fromunbound to partially bound RNAP and ˜ k on the analogous rate from partially to fully bound RNAP,while k off ,j denotes the unbinding rate from site j .promoter and the other processes that constitute transcription. In the quasi-equilibrium frameworkshown in Fig. 1B, gene expression is given by the average occupancy of RNAP in each of its states,namely, GE = r + e − β ( E BG + E Spacer ) (cid:0) r e − βE -35 + r e − βE -10 + r max e − β ( E -35 + E -10 + E int ) (cid:1) e − β ( E BG + E Spacer ) (cid:0) e − βE -35 + e − βE -10 + e − β ( E -35 + E -10 + E int ) (cid:1) . (3)We call this expression a refined energy matrix model since it reduces to the energy matrix Eq. 1 (withconstant of proportionality r max E − βE int ) in the limit where gene expression is negligible when the RNAPis not bound ( r ≈

0) and the promoter is sufficiently weak or the RNAP concentration is sufficientlysmall that polymerase is most often in the unbound state (so that the denominator ≈ R = 0 .

91) while only requiring two more parameters ( r and E int ) than the energy matrixmodel Eq. 1. The sharp boundaries on the left and right represent the minimum and maximum levels ofgene expression, r = 0 .

18 and r max = 8 .

6, respectively (see Appendix E). The refined energy matrixpredicts that the top 5% of promoters will exhibit expression levels of 7.6 (compared to 8.5 measuredexperimentally) while the weakest 5% of promoters should express at 0.2 (compared to the experimentallymeasured 0.1). In addition, this model quickly gains predictive power, as its coefficient of determinationonly slightly diminishes ( R = 0 .

86) if the model is trained on only 10% of the data and used to predictthe remaining 90%.

Epistasis-Free Models of Gene Expression Lead to Sharp Predictions thatDisagree with the Data

To further validate that the lower coefficient of determination of the energy matrix approach (Eq. 1) wasnot an artifact of the fitting, we can utilize the epistasis-free nature of this model to predict the geneexpression of double mutants from that of single mutants. More precisely, denote the gene expressionGE (0 , of a promoter with the consensus -35 and -10 sequences (and any background or spacer sequence).4 - - Predicted M ea s u r e d Energy Matrix R = + Consensus - - Predicted M ea s u r e d Refined Energy Matrix R = + A B

Figure 2. Gene expression of promoters with no UP element.

Model predictions using (A) anenergy matrix (Eq. 1) where the -35 and -10 elements independently contribute to RNAP binding and(B) a refined energy matrix (Eq. 3) where the two sites contribute cooperatively. Inset: The epistasis-freenature of the energy matrix model makes sharp predictions about the gene expression of the consensus-35 and -10 sequences that markedly disagree with the data. Parameter values given in Appendix B.Let GE (1 , , GE (0 , , and GE (1 , represent promoters (with this same background and spacer) whose-35/-10 sequences are mutated/consensus, consensus/mutated, and mutated/mutated, respectively, where“mutated” stands for any non-consensus sequence. As derived in Appendix D, the gene expression ofthese three later sequences can predict the gene expression of the promoter with the consensus -35 and-10 without recourse to fitting, namely,GE (0 , = GE (1 , GE (0 , GE (1 , GE (1 , GE (1 , . (4)The inset in Fig. 2A compares the epistasis-free predictions ( x -axis, right-hand side of Eq. 4) with themeasured gene expression ( y -axis, left-hand side of Eq. 4). These results demonstrate that the simpleenergy matrix formulation fails to capture the interaction between the -35 and -10 binding sites. Whilethis calculation cannot readily generalize to the refined energy matrix model since it exhibits epistasis, itis analytically tractable for weak promoters where the refined energy matrix model displays a markedimprovement over the traditional energy matrix model (see Appendix C). RNAP Binding to the UP Element occurs Independently of the Other Pro-moter Elements

Having seen that the refined energy matrix model (Eq. 3) can outperform the traditional energy matrixanalysis on promoters with no UP element, we next extend the former model to promoters containingan UP element. Given the importance of the RNAP interactions with the -35 and -10 sites seen above,Fig. 3A shows three possible mechanisms for how the UP element could mediate RNAP binding. Forexample, the C-terminal could bind strongly and independently so that RNAP has three distinct bindingsites. Another possibility is that the RNAP α CTD binds if and only if the -35 binding site is bound. Athird alternative is that the UP element contributes additively and independently to RNAP binding(analogous to the spacer and background).To distinguish between these possibilities, we analyze the correlations in gene expression betweenevery pair of promoter elements (UP and -35, spacer and background, etc.) to determine the strengthof their interaction. Each model in Fig. 3A will have a different signature: The top schematic predicts5

P binds weaklyand independentlyUP binds weaklyand independently - - Predicted M ea s u r e d R = + SchematicDescriptionUP binds stronglybut only with the -35but only with the -35 SchematicSchematicUP binds stronglybut only with the -35UP binds stronglybut only with the -35UP binds strongly and dependently with -35/-10

A B

Figure 3. The interaction between RNAP and the UP element. (A) Possible mechanisms bywhich the RNAP C-terminal can bind to the UP element (orange segments represent strong bindingcomparable to the -35 and -10 motifs; gray segments represent weak binding comparable to the spacerand background). The data supports the bottom schematic (see Appendix D). (B) The correspondingcharacterization of 8,192 promoters identical to those shown in Fig. 2 but with one of two UP bindingmotifs. Red points represent promoters with a consensus -35 and -10. Data was fit using the sameparameters as in Fig. 2B and fitting the binding energies of the two UP elements (parameter values inAppendix B).strong interactions between the -35 and -10, between the UP and -35, and between the UP and -10; themiddle schematic would give rise to strong dependence between the -35 and -10 as well as between theUP and -10, while the UP and -35 elements would be perfectly correlated; the bottom schematic suggeststhat the UP elements will contribute independently of the other promoter elements.This analysis, which we relegate to Appendix D, demonstrates that the UP element is approximatelyindependent of all other promoter elements ( R (cid:38) .

6) as are the background and spacer, indicatingthat the bottom schematic in Fig. 3A characterizes the binding of the UP element. This leads us to thegeneral form of transcriptional regulation by RNAP, shown in Eq. 5.GE = r + e − β ( E BG + E Spacer + E UP ) (cid:0) r e − βE -35 + r e − βE -10 + r max e − β ( E -35 + E -10 + E int ) (cid:1) e − β ( E BG + E Spacer + E UP ) (cid:0) e − βE -35 + e − βE -10 + e − β ( E -35 + E -10 + E int ) (cid:1) (5)Fig. 3B demonstrates how the expression of all promoters containing one of the two UP elementscombined with each of the eight background, spacer, -35, and -10 sequences (2 × = 8 ,

192 promoters)closely matches the model predictions ( R = 0 . Sufficiently Strong RNAP-Promoter Binding Energy can Decrease Gene Ex-pression

Although the 12,288 promoters considered above are well characterized by Eq. 5 on average, the datademonstrate that the full mechanistic picture is more nuanced. For example, Urtecho et al. found thatgene expression (averaged over all backgrounds and spacers) generally increases for -35/-10 elementscloser to the consensus sequences (8). In terms of the gene expression models studied above (Eqs. 1-3),promoters with fewer -35/-10 mutations have more negative free energies E -35 and E -10 leading to largerexpression. Yet the strongest promoters with the consensus -35/-10 violated this trend, exhibiting less - - - - - Δ E RNAP ( k B T ) M ea s u r e d Figure 4. Gene expression is reduced when RNAP binds a promoter too tightly.

Measuredgene expression versus the inferred promoter strength ∆ E RNAP relative to the transcription initiationstate ∆ E trans = − . k B T (stronger promoters on the right). The dashed line shows the prediction ofthe refined energy matrix model.expression than promoters one mutation away. Thus, Urtecho et al. postulated that past a certain point,promoters that bind RNAP too tightly may inhibit transcription initiation and lead to decreased geneexpression.The promoters with a consensus -35/-10 are shown as red points in Fig. 3B, and indeed thesepromoters are all predicted to bind tightly to RNAP and hence express at the maximum level r max = 8 . E trans relative tounbound RNAP that competes with the free energy ∆ E RNAP between fully bound and unbound RNAP(see Appendix E).Assuming the rate of transcription initiation is proportional to the relative Boltzmann weights ofthese two states, the level of gene expression r max in Eq. 5 will be modified to r max + r e − β (∆ E RNAP − ∆ E trans ) e − β (∆ E RNAP − ∆ E trans ) . (6)As expected, this expression reduces to r max for promoters that weakly bind RNAP ( e − β (∆ E RNAP − ∆ E trans ) (cid:28)

1) but decreases for strong promoters until it reaches the background level r when the promoter bindsso tightly that RNAP is glued in place and unable to initiate transcription. Upon reanalyzing the geneexpression data with the inferred value ∆ E trans = − . k B T (see Appendix E), we can plot the measuredlevel of gene expression against the predicted RNAP-promoter free energy ∆ E RNAP as shown in Fig. 4(stronger promoters to the right). We find that this revised model captures the downwards trend in geneexpression observed for the strongest promoters, most of which contain a consensus -35/-10.

The Bivalent Binding of RNAP Buffers its Binding Behavior against Pro-moter Mutations

In this final section, we investigate how the avidity between the -35 and -10 sites changes the dynamicsof RNAP binding. More specifically, we consider the effective dissociation constant governing RNAPbinding when both the -35 and -10 sites are intact and compare it to the case where only one site iscapable of binding. To simplify this discussion, we focus exclusively on the case of RNAP binding tothe -35 and -10 motifs as shown in the rates diagram Fig. 1C, absorbing the effects of the background,spacer, and UP elements into these rates. 7 - - - - - - - - - - - - - - - - K - ( M ) K D e ff ( M ) - - - - - cc - - Figure 5. The dissociation between RNAP and the promoter.

RNAP binding to a promoterwith a strong (solid lines, K -35 = 1 µ M) or weak (dashed, K -35 → ∞ ) -35 sequence. c represents thelocal concentration of singly bound RNAP.At equilibrium, there is no flux between the four RNAP states. We define the effective dissociationconstant K eff D = K -35 K -10 c + K -35 + K -10 (7)which represents the concentration of RNAP at which there is a 50% likelihood that the promoter isbound (see Appendix F). K j = k off ,j k on stands for the dissociation constant of free RNAP binding to thesite j and c = ˜ k on k on = [RNAP] e − βE int represents the increased local concentration of singly bound RNAPtransitioning to the fully bound state (i.e., E int and c are the embodiments of avidity in the language ofstatistical mechanics and thermodynamics, respectively). Note that K eff D is a sigmoidal function of K -10 with height K -35 and midpoint at K -10 = c + K -35 .Fig. 5 demonstrates how the effective RNAP dissociation constant K eff D changes when mutationsto the -10 binding motif alter its dissociation constant K -10 . When the -35 sequence is weak (dashedlines, k off , -35 → ∞ ), K eff D ≈ K -10 signifying that RNAP binding relies solely on the strength of the -10site. In the opposite limit where RNAP tightly binds to the -35 sequence (solid lines), the cooperativity c and the dissociation constant K -35 shift the curve horizontally and bound the effective dissociationconstant to K eff D ≤ K -35 . This upper bound may buffer promoters against mutations, since achievinga larger effective dissociation constant would require not only wiping out the -35 site but in additionmutating the -10 site. Finally, in the case where the cooperativity c is large, K eff D ≈ K -10 K -35 c indicatingthat as soon as one site of the RNAP binds, the other is very likely to also bind, thereby giving rise tothe multiplicative dependence on the two K D s.To get a sense for how these numbers translate into physiological RNAP dwell times on the promoter,we note that the lifetime of bound RNAP is given by τ = K eff D k on (see Appendix F). Using K eff D ≈ − Mfor the strong T7 promoter (14) and assuming a diffusion-limited on-rate 10 · s leads to a dwell time of10 s, comparable to the measured dwell time of RNAP-promoter in the closed complex (15). It would befascinating if recently developed methods that visualize real-time single-RNAP binding events probed thedwell time of the promoter constructed by Urtecho et al. to see how well the predictions of the refinedenergy matrix model match experimental measurements (15). Discussion

While high-throughput methods have enabled us to measure the gene expression of tens of thousandsof promoters, they nevertheless only scratch the surface of the full sequence space. A typical promotercomposed of 200 bp has 4 variants (more than the number of atoms in the universe). Nevertheless,by understanding the principles governing transcriptional regulation, we can begin to cut away at thisdaunting complexity to design better promoters. 8n this work, we analyzed a recent experiment by Urtecho et al. measuring gene expression ofover 10,000 promoters in

E. coli using the σ RNAP holoenzyme (8). These sequences comprised allcombinations of a small set of promoter elements, namely, eight -10s, eight -35s, eight spacers, eightbackgrounds, and three UPs depicted in Fig. 1A, providing an opportunity to deepen our understandingof how these elements interact and to compare different quantitative models of gene expression.We first analyzed this data using classic energy matrix models which posit that each promoter elementcontributes independently to the RNAP-promoter binding energy. As emphasized by Urtecho et al. andother groups, such energy matrices poorly characterize gene expression (Fig. 2A, R = 0 .

57) and offertestable predictions that do not match the data (Appendix C), mandating the need for other approaches(7, 8).To meet this challenge, we first determined which promoter elements contribute independently toRNAP binding (Appendix D). This process, which was done without recourse to fitting, demonstratedthat the -35 and -10 elements bind in a concerted manner that we postulated is caused by avidity. Inthis context, avidity implies that when RNAP is singly bound to either the -35 or -10 sites, it is muchmore likely (compared to unbound RNAP) to bind to the other site, similar to the boost in binding seenin bivalent antibodies (16) or multivalent systems (12, 17, 18). Surprisingly, we found that outside the-35/-10 pair, the other components of the promoter contributed independently to RNAP binding.Using these findings, we developed a refined energy matrix model of gene expression (Eq. 5) that incor-porates the avidity of between the -35/-10 sites as well as the independence of the UP/spacer/backgroundinteractions. This model was able to characterize the 4,096 promoters with no UP element (Fig. 2B, R = 0 .

91) and the 8,192 promoters containing an UP element (Fig. 3B, R = 0 . R = 0 . E int arising from the-35/-10 avidity and the level of gene expression r of a promoter with a scrambled -10 motif, a scrambled-35 motif, or with both motifs scrambled).These promising findings suggest that determining which components bind independently is crucialto properly characterize multivalent systems. It would be fascinating to extend this study to RNAPwith other σ factors (13) as well as to RNAP mutants with no α CTD or that do not bind at the -35site (19, 20). Our model would predict that polymerases in this last category with at most one strongbinding site should conform to a traditional energy matrix approach.Quantitative frameworks such as the refined energy matrix model explored here can deepen ourunderstanding of the underlying mechanisms governing a system’s behavior. For example, while searchingfor systematic discrepancies between our model prediction and the gene expression measurements, wefound that promoters predicted to have the strongest RNAP affinity did not exhibit the largest levels ofgene expression (thus violating a core assumption of nearly all models of gene expression that we knowof). This led us to posit a characteristic energy for transcription initiation that reduces the expressionof overly strong promoters (Fig. 4). In addition, we explored how having separate binding sites at the-35 and -10 elements buffers RNAP kinetics against mutations; for example, no single mutation cancompletely eliminate gene expression of a strong promoter with the consensus -35 and -10 sequence,since at least one mutation in both the -35 and -10 motifs would be needed (Fig. 5).Finally, we end by zooming out from the particular context of transcription regulation and notethat multivalent interactions are prevalent in all fields of biology (21), and our work suggests thatdifferentiating between independent and dependent interactions may be key to not only characterizingoverall binding affinities but to also understand the dynamics of a system (22). Such formulations may beessential when dissecting the much more complicated interactions in eukaryotic transcription where largecomplexes bind at multiple DNA loci (23, 24) and more broadly in multivalent scaffolds and materials(11, 12).

Methods

We trained both the standard and refined energy matrix models on 75% of the data and characterizedthe predictive power on the remaining 25%, repeating the procedure 10 times. The coefficient of9etermination R was calculated for y data = log (gene expression) to prevent the largest gene expressionvalues from dominating the result. The supplementary Mathematica notebook contains the data analyzedin this work and can recreate all plots. Acknowledgements

We thank Suzy Beeler, Vahe Galstyan, Peng (Brian) He, and Zofii Kaczmarek for helpful discussions.This work was supported by the Rosen Center at Caltech and the National Institutes of Health through1R35 GM118043-01 (MIRA).

References

1. Gama-Castro, S., H. Salgado, A. Santos-Zavaleta, D. Ledezma-Tejeida, L. Muniz-Rascado, J. S.Garcia-Sotelo, K. Alquicira-Hernandez, I. Martinez-Flores, L. Pannier, J. A. Castro-Mondragon,A. Medina-Rivera, H. Solano-Lira, C. Bonavides-Martinez, E. Perez-Rueda, S. Alquicira-Hernandez,L. Porron-Sotelo, A. Lopez-Fuentes, A. Hernandez-Koutoucheva, V. Del Moral-Chavez, F. Rinaldi,and J. Collado-Vides. 2016. RegulonDB version 9.0: high-level integration of gene regulation,coexpression, motif clustering and beyond.

Nucleic acids research . 44:D133–D143.2. Patwardhan, R. P., C. Lee, O. Litvin, D. L. Young, D. Pe’er, and J. Shendure. 2009. High-resolutionanalysis of DNA regulatory elements by synthetic saturation mutagenesis.

Nature Biotechnology .27:1173–1175.3. Kinney, J. B., A. Murugan, C. G. Callan, and E. C. Cox. 2010. Using deep sequencing to characterizethe biophysical mechanism of a transcriptional regulatory sequence.

Proceedings of the NationalAcademy of Sciences . 107:9158–9163.4. Inoue, F., and N. Ahituv. 2015. Decoding enhancers using massively parallel reporter assays.

Genomics . 106:159–164.5. Yona, A. H., E. J. Alm, and J. Gore. 2018. Random sequences rapidly evolve into de novo promoters.

Nature Communications . 9:1530.6. Belliveau, N. M., S. L. Barnes, W. T. Ireland, D. L. Jones, M. J. Sweredoski, A. Moradian, S. Hess,J. B. Kinney, and R. Phillips. 2018. Systematic approach for dissecting the molecular mechanismsof transcriptional regulation in bacteria.

Proceedings of the National Academy of Sciences .7. Forcier, T., A. Ayaz, M. Gill, and J. B. Kinney. 2018. Precision measurements of regulatoryenergetics in living cells. bioRxiv .8. Urtecho, G., A. D. Tripp, K. Insigne, H. Kim, and S. Kosuri. 2018. Systematic Dissection of SequenceElements Controlling σ

70 Promoters Using a Genomically-Encoded Multiplexed Reporter Assay inE. coli.

Biochemistry :acs.biochem.7b01069.9. Kuhlman, T., Z. Zhang, M. H. Saier, and T. Hwa. 2007. Combinatorial transcriptional controlof the lactose operon of Escherichia coli.

Proceedings of the National Academy of Sciences of theUnited States of America . 104:6043–6048.10. Boedicker, J. Q., H. G. Garcia, S. Johnson, and R. Phillips. 2013. DNA sequence-dependentmechanics and protein-assisted bending in repressor-mediated loop formation.

Physical Biology .10:066005.11. Varner, C. T., T. Rosen, J. T. Martin, and R. S. Kane. 2015. Recent Advances in EngineeringPolyvalent Biological Interactions.

Biomacromolecules . 16:43–55.12. Yan, G.-H., K. Wang, Z. Shao, L. Luo, Z.-M. Song, J. Chen, R. Jin, X. Deng, H. Wang, Z. Cao,Y. Liu, and A. Cao. 2018. Artificial antibody created by conformational reconstruction of thecomplementary-determining region on gold nanoparticles.

Proceedings of the National Academy ofSciences . 115:E34–E43. 103. Fekl´ıstov, A., B. D. Sharon, S. A. Darst, and C. A. Gross. 2014. Bacterial Sigma Factors: AHistorical, Structural, and Genomic Perspective.

Annual Review of Microbiology . 68:357–376.14. Dayton, C. J., D. E. Prosen, K. L. Parker, and C. L. Cech. 1984. Kinetic measurements of Escherichiacoli RNA polymerase association with bacteriophage T7 early promoters.

The Journal of biologicalchemistry . 259:1616–21.15. Wang, F., S. Redding, I. J. Finkelstein, J. Gorman, D. R. Reichman, and E. C. Greene. 2013. Thepromoter-search mechanism of Escherichia coli RNA polymerase is dominated by three-dimensionaldiffusion.

Nature Structural and Molecular Biology .16. Klein, J. S., and P. J. Bjorkman. 2010. Few and Far Between: How HIV May Be Evading AntibodyAvidity.

PLoS Pathogens . 6:e1000908.17. Banjade, S., and M. K. Rosen. 2014. Phase transitions of multivalent proteins can promote clusteringof membrane receptors. eLife . 3.18. Huang, J., X. Zeng, N. Sigal, P. J. Lund, L. F. Su, H. Huang, Y.-h. Chien, and M. M. Davis.2016. Detection, phenotyping, and quantification of antigen-specific T cells using a peptide-MHCdodecamer.

Proceedings of the National Academy of Sciences of the United States of America .113:E1890–7.19. Kumar, A., R. A. Malloch, N. Fujita, D. A. Smillie, A. Ishihama, and R. S. Hayward. 1993. TheMinus 35-Recognition Region of Escherichia coli Sigma 70 is Inessential for Initiation of Transcriptionat an Extended Minus 10 Promoter.

Journal of Molecular Biology . 232:406–418.20. Minakhin, L., and K. Severinov. 2003. On the role of the Escherichia coli RNA polymerase sigma 70region 4.2 and alpha-subunit C-terminal domains in promoter complex formation on the extended-10 galP1 promoter.

The Journal of biological chemistry . 278:29710–8.21. Gao, A., K. Shrinivas, P. Lepeudry, H. I. Suzuki, P. A. Sharp, and A. K. Chakraborty. 2018.Evolution of weak cooperative interactions for biological specificity.

Proceedings of the NationalAcademy of Sciences of the United States of America :201815912.22. Stone, J. D., M. N. Artyomov, A. S. Chervin, A. K. Chakraborty, H. N. Eisen, and D. M. Kranz.2011. Interaction of streptavidin-based peptide-MHC oligomers (tetramers) with cell-surface TCRs.

Journal of immunology . 187:6281–6290.23. Goardon, N., J. A. Lambert, P. Rodriguez, P. Nissaire, S. Herblot, P. Thibault, D. Dumenil,J. Strouboulis, P.-H. Romeo, and T. Hoang. 2006. ETO2 coordinates cellular proliferation anddifferentiation during erythropoiesis.

The EMBO journal . 25:357–66.24. Levine, M., C. Cattoglio, and R. Tjian. 2014. Looping Back to Leap Forward: Transcription Entersa New Era.