Exact solution of stochastic gene expression models with bursting, cell cycle and replication dynamics
EExact solution of stochastic gene expression modelswith bursting, cell cycle and replication dynamics
Casper H. L. Beentjes , Ruben Perez-Carrasco , and Ramon Grima Mathematical Institute, University of Oxford, UK Department of Mathematics, University College London, UK School of Biological Sciences, University of Edinburgh, UKFebruary 17, 2020
Abstract
The bulk of stochastic gene expression models in the literature do not have an explicit descrip-tion of the age of a cell within a generation and hence they cannot capture events such as celldivision and DNA replication. Instead, many models incorporate cell cycle implicitly by assumingthat dilution due to cell division can be described by an effective decay reaction with first-order ki-netics. If it is further assumed that protein production occurs in bursts then the stationary proteindistribution is a negative binomial. Here we seek to understand how accurate these implicit modelsare when compared with more detailed models of stochastic gene expression. We derive the exactstationary solution of the chemical master equation describing bursty protein dynamics, binomialpartitioning at mitosis, age-dependent transcription dynamics including replication, and randominterdivision times sampled from Erlang or more general distributions; the solution is differentfor single lineage and population snapshot settings. We show that protein distributions are wellapproximated by the solution of implicit models (a negative binomial) when the mean numberof mRNAs produced per cycle is low and the cell cycle length variability is large. When theseconditions are not met, the distributions are either almost bimodal or else display very flat regionsnear the mode and cannot be described by implicit models. We also show that for genes with lowtranscription rates, the size of protein noise has a strong dependence on the replication time, it isalmost independent of cell cycle variability for lineage measurements and increases with cell cyclevariability for population snapshot measurements. In contrast for large transcription rates, thesize of protein noise is independent of replication time and increases with cell cycle variability forboth lineage and population measurements.
It is well known that gene expression is stochastic [1]. The randomness in the time at which eachreaction occurs leads to fluctuations in the molecule number of gene products such as mRNA andproteins. Hence over the past two decades there has been considerable effort devoted to constructingand solving stochastic models of gene expression [2, 3]. The exact solution of the chemical masterequation (CME) describing the standard models of stochastic gene expression is currently unknownexcept in certain limiting cases such as when mRNA degrades much faster than protein [4].The majority of gene expression models in the literature do not have a description of cellular ageand hence do not explicitly describe the cell cycle [4, 5, 6, 7, 8, 9, 10, 11, 12]. Rather it is assumed,following [13, 14], that protein dilution effects due to cell division can be implicitly included via aneffective first-order decay reaction. The rate of this reaction is chosen such that the half-life of proteinnumbers corresponds to the mean cell cycle length. This approximation is thought to be reasonable1 a r X i v : . [ q - b i o . S C ] F e b ince active protein degradation timescales are considerably longer than the cell cycle time [15, 16] andhence dilution occurring during cell division is the dominant means of protein removal.Since these effective models do not have a description of the cell age within a cell cycle, they alsocannot take into account events which happen at specific points during the cycle, e.g. the replication ofthe genome which leads to an increase of the transcription rate. The main advantage of these models isthe relative ease with which they can be analytically solved, approximated and simulated. In particular,the chemical master equation of the most commonly used model of this type, which describes proteinsproduced in bursts whose size is sampled from the geometric distribution and protein decay via aneffective first-order reaction (modelling dilution as described above), can be solved exactly leading toa negative binomial distribution (or a gamma distribution, its continuous analog) of protein numbers[4, 14].In contrast to this implicit model, more sophisticated models have been developed during the pastfew years that include an explicit description of the cell cycle. Curiously, results for the case of periodiccell division were first obtained by Berg [17], about twenty years before the explosion of interest instochastic gene expression [1]. More recently, Johnston and Jones obtained the distribution of proteinnumbers assuming non-bursty production, binomial partitioning at cell division and regularly spaced(periodic) cell division events [18]. Since experimental data clearly shows that the time betweentwo successive cell division effects is a random variable [19, 20], models were also devised to studyhow dynamics are influenced by this extra source of randomness. Antunes and Singh [21] obtainedthe moments of mRNA and protein numbers in a simplified model of gene expression which ignoresintrinsic noise (due to the stochastic birth-death of individual molecules) and that due to binomialpartitioning but takes into account noise stemming from the random timing of cell division events.Soltani et al. [22, 23] obtained the mean and variance of protein numbers in a considerably more detailedmodel for stable protein (one that is degraded only by dilution) that includes intrinsic noise, stochasticpartitioning of molecules at cell division, a cell cycle that is divided in a number of phases whoseduration is exponentially distributed and also can include replication. In these studies the formulaeare obtained assuming single lineage measurements, i.e. upon cell division, one of the daughter cells isfollowed (the other discarded) such that one obtains information about the stochastic dynamics of acell’s protein contents along an arbitrarily chosen lineage (measurements done using a mother machinesuch as in Ref. [24]). The two major disadvantages of the latter two papers are that they do not deriveresults for the protein number distributions and also they do not calculate statistics in a growingpopulation of cells, the most common experimental scenario (measurements done using flow cytometrysuch as [25]). Jędrak et al. [26] derived an explicit expression for the protein distribution solution ofa stochastic model where protein fluctuations are treated continuously, there are no gene duplicationeffects, and where the cell cycle is assumed to be exponentially distributed. These results analyzethe behaviour of a single lineage (as previous papers), and also for a whole proliferating population,i.e. both daughter cells are followed upon cell division such that one obtains information about thestochastic dynamics of a cell’s protein contents across a growing population. Note that lineage andpopulation statistics are not generally equivalent, unlike what one may assume based on the ergodichypothesis [27]. The major limitations of the model in Ref. [26] are the large protein approximationimplicit in the continuous approximation, the lack of DNA replication and the assumption that cellcycle duration is exponentially distributed which is contrary to experimental evidence that revealsdistributions comparable to Erlang, gamma or lognormal distributions or variations thereof [20, 28].Given these two different approaches including implicit and explicit description of the cell cycle,a question arises: How well can the negative binomial distribution of implicit models describe theprotein distribution of more detailed models of gene expression? This question remains unansweredbecause as discussed above, none of the current literature derives the protein distribution in a modelthat explicitly includes dilution due to stochastic partitioning of molecules at cell division, randominterdivision times, and age-dependent transcription. In this paper we answer this question by derivingexpressions for the distributions of proteins in models that incorporate explicit descriptions of the cellcycle. For the sake of clarity, instead of starting from the most general model, our presentationconsiders a set of simpler models which gradually build up to it. The three models that we study, in2rder of complexity, have the following properties: (i) no replication and a cell cycle of fixed duration;(ii) no replication and an Erlang distributed cell cycle duration; (iii) age-dependent transcriptionaldynamics including replication and a cell cycle described by a number of phases each of which has anexponentially distributed duration (hypoexponential distribution). All the models consider proteinsthat are produced in bursts [29], degraded only via dilution (stable proteins [14]) and assume binomialpartitioning of proteins at cell division [17]. We study the relationships between the solutions ofall models for both single lineage and population snapshot statistics, and identify conditions underwhich the protein distributions can be well approximated by the negative binomial solution of theconventional model of gene expression with an implicit cell cycle description. It is well known that under the assumptions that mRNA degrades much faster than protein and thatpromoter switching is also much faster than protein decay, the stochastic dynamics of protein P canbe effectively described by the reaction scheme [4] G r −→ G + mP, P d −→ ∅ , (1)where G denotes a single gene copy, r is the effective burst production rate and d is the protein degra-dation rate. Note that protein is produced in bursts of size m , which follows a geometric distribution(in accordance with experiments [29]), i.e. m ∼ Geom ( p ) with a mean burst size α = (1 − p ) /p = h/d m where h is the protein translation rate and d m is the mRNA degradation rate. This burstiness implic-itly describes the mRNA dynamics since a burst in protein expression occurs due to rapid translationof proteins by a single short-lived mRNA. Hence within this context, the parameter r is also the sameas the effective mRNA transcription rate which is given by r = ρ u σ u / ( σ b + σ u ) where ρ u is the mRNAtranscription rate, σ b is the rate of switching from the active state to the inactive state and σ u is therate of switching from the inactive state to the active state.Note that this model has been rigorously derived from a three-stage model of gene expression thatdoes not take the cell cycle explicitly into account [4]. However, it is commonly assumed that theprotein degradation reaction effectively models the dilution which occurs due to binomial partitioningat cell division. The question then is how we should choose the effective protein degradation rate.A simple argument is as follows. Since the protein decays exponentially via an effective first-orderreaction, its half-life is t d = log(2) /d ; we also know that the protein concentration is on average halvedat cell division due to binomial partitioning and hence t d = T , where T is the cell cycle length. Henceit follows that the effective protein degradation rate should be chosen as d = log(2) /T . The effectivemodel given by reaction scheme (1) and d chosen as aforementioned is one of the standard models ofgene expression in the literature [14].In Appendix A.1, we show that the effective model described above provides an accurate descriptionof the mean number of proteins in a three-stage model of gene expression with explicit mRNA andprotein dynamics, binomial partitioning and fixed cell cycle length T , provided (i) the mRNA degradesmuch faster than protein; (ii) promoter switching is much faster than protein decay; (iii) mRNAdegrades on a much shorter timescale than the cell cycle length T ; (iv) the mean number of proteinsis calculated from population measurements. Note that if instead the mean number of proteins iscalculated from single lineage measurements then we arrive at model (1) but with effective proteindegradation rate given by d = (2 / /T (see Appendix A). Note that effective degradation rates derivedfor lineage data are slightly smaller than their population snapshot equivalent, because / < log 2 .This discrepancy stems from the fact that the mean number of proteins calculated from populationmeasurements is smaller than the mean number of proteins calculated in single lineage measurements.This is because in population snapshots, all cells are tracked and hence due to a doubling of thenumber of cells at cell division, there is a bias towards observing young cells with small protein counts.3n contrast, in lineage measurements since only one cell is tracked, the probability of observing acell of any age is the same and hence the protein counts on average are higher than in populationmeasurements. A detailed discussion of the difference between these two types of measurement can befound in [27].The CME of Model I whose reactions are given by (1) is straightforward to solve using the methodof generating functions. In steady-state, its solution reads G ( z ) = (cid:18) − α ( z − (cid:19) β , (2)where G ( z ) = (cid:80) n z n P ( n ) is the probability generating function (PGF), P ( n ) is the steady-stateprotein distribution of protein number n and β = r/d . Note that β is the average number of mRNAmolecules produced in the protein life-time. In the case of stable proteins (which degrade only bydilution) it follows from our previous results for effective degradation rates that β = 3 y/ for lineagemeasurements and β = y/ log 2 for population snapshot measurements, where we have defined y = rT as the average number of mRNA produced in a cell cycle. The distribution can be obtained using P ( n ) = (1 /n !) dG n /dz n | z =0 which leads to a negative binomial NB ( β, α/ (1 + α )) . Next we consider a model where protein production occurs in bursts as in Model I but there is noeffective first-order reaction modelling protein degradation. Instead we explicitly model binomialpartitioning of the proteins at cell division. The major assumption of this model is that cell divisionoccurs at regular time intervals of length T . This is often referred to as a “timer” mechanism and hasbeen found in certain types of cells, e.g. early frog embryos [30]. In what follows we will find an exactsteady-state solution of the CME for this model and compare it with that of Model I.Let t ∈ [0 , T ] be the age of a given cell, namely t = 0 corresponds to its birth and t = T correspondsto the time at which it divides into two. The CME describing bursty protein expression and no activedegradation in a cell is given by dP j ( n, t ) dt = r ∞ (cid:88) m =0 P j ( n − m, t ) Q ( m ) − rP j ( n, t ) ∞ (cid:88) m =0 Q ( m ) , (3)where P j ( n, t ) is the probability that at cell age t in generation j there are n proteins observed. Here Q ( m ) = p (1 − p ) m with α = (1 − p ) /p is the geometric distribution with mean α . Note that each timecell division occurs, the generation number j is increased by one. The PGF equation corresponding tothe CME is ∂G j ( z, t ) ∂t = − rG j ( z, t ) (cid:18) −
11 + α (1 − z ) (cid:19) , (4)which has a time-dependent solution G j ( z, t ) = F j ( z ) exp (cid:18) − αrt (1 − z )1 + α (1 − z ) (cid:19) . (5)Note that F j ( z ) = (cid:80) n z n P j ( n, namely the PGF corresponding to the protein distribution at cellbirth in generation j .Introducing binomial partitioning at mitosis leads to a simple relationship between the proteindistribution at cell division of a cell in generation j and the distribution observed at the birth of thedaughter cell in generation j + 1 P j +1 ( n,
0) = ∞ (cid:88) i =0 (cid:18) in (cid:19) − i P j ( i, T ) , (6)4hich implies for the PGF F j +1 ( z ) = G j +1 ( z,
0) = G j (cid:18) z , T (cid:19) . (7)Note that in Eq. (6) we used the convention that i choose n equals zero when n < i . In this case wecannot impose steady-state as in Model I because cell division occurs at regular time intervals. Ratherwe consider cyclo-stationary conditions which are achieved when the probability that a cell of age t has a given number of proteins is independent of which generation it belongs to, i.e. the superscript j in Eqs. (3)-(7) can be ignored. Hence substituting Eq. (7) in Eq. (5) we obtain G ( z, t ) = G (cid:18) z , T (cid:19) exp (cid:18) − αrt (1 − z )1 + α (1 − z ) (cid:19) . (8)Next we proceed to solve Eq. (8) by substituting t = T in this equation to obtain G ( z, T ) = G (cid:18) z , T (cid:19) f ( z ) , (9)where f ( z ) = exp( − αrT (1 − z ) / (1 + α (1 − z )) . Eq. (9) can be solved by iteration as follows G ( z, T ) = f ( z ) f (cid:18) z (cid:19) G (cid:18) z , T (cid:19) , = f ( z ) f (cid:18) z (cid:19) f (cid:18) z (cid:19) G (cid:18) z , T (cid:19) , = G (1 , T ) ∞ (cid:89) s =0 f (cid:18) s + z − s (cid:19) = ∞ (cid:89) s =0 f (cid:18) s + z − s (cid:19) . (10)Note that here we used G (1 , T ) = 1 which follows from the normalisation of the distribution. Substi-tuting Eq. (10) in Eq. (9) we obtain G ((1 + z ) / , T ) which after substituting in Eq. (8) leads us to anexplicit solution of the generating function for Model II in cyclo-stationary conditions G ( z, t ) = (cid:32) ∞ (cid:89) s =0 f (cid:18) z − s +1 (cid:19)(cid:33) exp (cid:18) − αrt (1 − z )1 + α (1 − z ) (cid:19) , = exp (cid:18) rxt − x + rxT ∞ (cid:88) s =0 s − x (cid:19) , (11)where in the last line we used the definition of the function f and the definition x = α ( z − . Note thatthe sum over s in the argument of the exponent can be written in terms of the q -digamma function.The solution we have computed corresponds to the PGF of the protein distribution computed froman ensemble of identical cells all of which are at the same cell age t . However, distributions are oftencalculated from experimental measurements of time traces of the fluorescent protein molecules along acell lineage or else from population snapshots. Considering the single lineage case, the correspondingPGF is given by a time-average of the generating function calculated earlier G s ( z ) = (cid:90) T T G ( z, t ) dt = x − xy (cid:18) exp (cid:18) xyx − (cid:19) − (cid:19) exp (cid:18) xy ∞ (cid:88) s =0 s − x (cid:19) (12)where we have used y = rT as we did for Model I. Note that the subscript s will henceforth be used todenote single lineage measurement. Note also that the time-average is computed since the probability5f observing cells of any age is uniform in lineage measurements. Comparing this PGF to that ofModel I, i.e. Eq. (2), it is clear that in Model II the protein distribution is generally not equal to thenegative binomial distribution of Model I.To understand the differences between these two distributions, we next use the PGF’s given by Eq.(2) (with β = 3 y/ ) and Eq. (12) to compute the mean (cid:104) n (cid:105) and variance σ in stationary conditions (cid:104) n (cid:105) NB,s = 32 αy, (13) σ NB,s = 32 αy + 32 α y = 1 . αy + 1 . α y, (14) (cid:104) n (cid:105) s = 32 αy, (15) σ s = 32 αy + α (cid:18) y y (cid:19) ≈ . αy + 1 . α y + 0 . α y . (16)It is clear that while the mean of the two distributions is the same, the variances are generally different.In Appendix A, we clarify the origin of this discrepancy. In particular, Model I, under certain conditionsdescribed earlier, can match the mean number of proteins in a three-stage model of gene expressionwith explicit mRNA and protein dynamics, binomial partitioning and fixed cell cycle length T whereasModel II can match the full PGF of the three-stage model under the same conditions (see in particularAppendix A.2). The variance of Model II is always greater than that of Model I ( σ s > σ NB,s ).Furthermore, whilst the two variances are both quadratic in α , σ NB,s is linear in y while σ s is quadraticin y . The relative error between the two variances computed as ( σ s − σ NB,s ) /σ s increases monotonicallywith α and y but is mostly determined by the value of y , i.e. the average number of mRNA producedin a cell cycle, as illustrated in Fig. 1a. Expressions for the skewness squared can also be easily derivedfrom the PGF S NB,s = 2(2 α + 1) α ( α + 1) y , (17) S s = 108 (cid:0) α (7 y + 54) + 7 α ( y + 20) + 42 (cid:1) αy ( α ( y + 20) + 18) . (18)For small y , both of these expressions are proportional to /y while for large y , we have S NB,s ∝ /y while S s ∝ /y , i.e. for large enough y Model II will predict a less skewed distribution than Model I.Generally the skewness of the distribution of Model II can be larger or smaller than that of Model Idepending on the values of α and y (Fig. 1b).To get a fuller picture of the differences between the two models (assuming lineage measurements)we plot in Fig. 2 the distributions for various values of y whilst keeping the value of α fixed. Notethat the distribution of protein numbers for Model II is constructed by first expanding the generatingfunction Eq. (12) as a Taylor series using a symbolic computation software and then P ( n ) is simplygiven by the n th coefficient of this series. The theoretical predictions for Model II are also verifiedby means of the stochastic simulation algorithm (SSA, see Appendix B for a full description of thealgorithm). The negative binomial distribution of Model I (red line) is a good approximation of thedistribution of Model II (black line) for small y but clearly is inappropriate for large y ; it can also beshown that the difference between distributions becomes more pronounced if α is increased.A visual inspection of the distribution of Model II (black line) leads one to believe that one canlikely fit well an effective negative binomial for the cases shown in Fig. 2(a)-(c) but not for (d). Thisintuition is verified in Fig. 2 by plotting an effective negative binomial of the same mean and variance(green open circles) as the distribution of Model II; in Fig. 2d, the distribution is considerably flatternear the mode than the fitted negative binomial. Note that the effective negative binomial is given by6B ( z , z ) where z = 27 yy + 20 , (19) z = α ( y + 20) α ( y + 20) + 18 . (20)Since Model I has a negative binomial solution, it follows that through renormalisation of itsparameters, it can be matched to the effective negative binomial for Model II. If the renormalisedparameters for Model I are y e and α e , then its solution is NB ( z , z ) where z = 3 y e / , z = α e / (1+ α e ) .Equating z , z to those in Eqs. (19)-(20), we obtain α e = α (cid:18) y + 2018 (cid:19) , (21) y e = y (cid:18) y + 20 (cid:19) . (22)Note that renormalised Model I is the same as the green open circles shown in Fig. 2 which is amuch better approximation to Model II (black line) than the original Model I (red line). From theseequations, it is also clear that if parameters had to be estimated from experimental data (with lowcell cycle duration variability) using Model I, then the estimated mean burst size α e overestimates thetrue value α , while the estimated mean number of mRNA per cell cycle y e underestimates the truevalue y .In this section we have so far focused on the distributions for single lineage measurements. Thedistribution of protein numbers for population snapshots can also be derived and instead of Eq. (12)we then have G p ( z ) = (cid:90) T − t/T log(2) T G ( z, t ) dt, = ( x −
1) log 2 xy + ( x −
1) log 2 (cid:18) exp (cid:18) xy + ( x −
1) log 2 x − (cid:19) − (cid:19) exp (cid:32) xy ∞ (cid:88) s =0 s − x (cid:33) , (23)where the subscript p will be used to denote population snapshot measurements from hereon. Notethat we used the fact that when interdivision times are regularly spaced in time, the probability ofobserving a cell of age t ∈ [0 , T ] is − t/T log 2 /T for population measurements [17] (see also AppendixD for a derivation of the latter). The PGF of Model I has already been calculated for the populationscenario in Section 2 and was found to be given by Eq. (2) (with β = y/ log(2) ). For the parametersused in Fig. 2, the protein distributions for population data for Models I, II are found to be closeto those calculated for lineage data and hence we do not show them. The mean and variance of theprotein numbers of Models I and II are now given by (cid:104) n (cid:105) NB,p = αy log(2) , (24) σ NB,p = α (1 + α ) y log(2) ≈ . αy + 1 . α y, (25) (cid:104) n (cid:105) p = αy log(2) , (26) σ p = 1log 2 αy + 6 − α y + 1 − log α y ≈ . αy + 1 . α y + 0 . α y . (27)These equations are the population equivalent of Eqs. (13)-(16) and the same observations we madeearlier regarding the comparison of the moments of Model I and Model II for single lineage data are7lso seen to hold for population data. Note also that generally we can state (cid:104) n (cid:105) p < (cid:104) n (cid:105) s and σ p < σ s ,which can be explained by the enhanced probability of observing younger cells (and hence having asmaller protein content) in population measurements (as mentioned in Section 2).Summarizing, our results in this section imply that the effective degradation reaction in ModelI cannot effectively account for dilution via binomial partitioning. Generally the models agree onthe mean number of proteins in stationary conditions but not on the higher-order moments. Thediscrepancies are particularly obvious whenever y is greater than a few tens. The variance of Model Iis always less than that of Model II but the skewness of Model I can be greater or smaller than that ofModel II. We have also shown that the protein distribution of Model II can be well approximated byan effective negative binomial distribution only if y , the mean number of mRNAs produced in a cellcycle, is small. In this case, it is possible to renormalise the parameters in Model I so that its solutionapproximates that of Model II well. Next we consider a more complex and realistic model of bursty gene expression, namely one thatincludes cell cycle length variability. Such variability could for example originate in cell types wherethe cell growth rate is stochastic and cell division is triggered when the volume of a cell exceeds itsvolume at birth by a certain fixed amount (also called an “adder” mechanism, see for example [31]).As we shall see, this requires a very different master equation description than the previous models.Specifically, the model has the following properties: (i) the cell cycle is divided in N phases where theduration of each phase is exponentially distributed with parameter k . It then follows that the cell cyclelength distribution is Erlang, the mean cell cycle time is T = N/k and the coefficient of variation ofthe cell cycle duration is / √ N . (ii) proteins are produced at a rate r and in geometrically distributedburst sizes with mean α . (iii) cell division occurs instantaneously after the end of the N th phase whichleads to binomial partitioning of proteins between mother and daughter cells. Note that an Erlangdistribution provides a good fit to some of the measured cell cycle length distributions [21, 28].Let P i ( n, t ) be the probability that the cell cycle is in phase i at time t and that there are n proteinmolecules. We shall here ignore the generation number since in steady-state conditions this does notmatter. It then follows that the master equation describing the above model is given by dP ( n, t ) dt = − kP ( n, t ) + kP (cid:48) N ( n, t )+ r ∞ (cid:88) m =0 P ( n − m, t ) Q ( m ) − rP ( n, t ) ∞ (cid:88) m =0 Q ( m ) , (28) dP i ( n, t ) dt = − kP i ( n, t ) + kP i − ( n, t )+ r ∞ (cid:88) m =0 P i ( n − m, t ) Q ( m ) − rP i ( n, t ) ∞ (cid:88) m =0 Q ( m ) , i ∈ [2 , N ] (29)where Q ( m ) = p (1 − p ) m is the geometric distribution with mean α = (1 − p ) /p . The first term inthese equations models exit from the present cell cycle phase into the next phase, the second termmodels the entry into the present cell cycle phase from the previous one, and the third term modelsbursty protein production. Note that binomial partitioning during cell division is explicitly taken intoaccount by the second term of Eq. (28). In particular this process implies P (cid:48) N ( n, t ) = ∞ (cid:88) m =0 (cid:18) mn (cid:19) − m P N ( m, t ) , (30)8here we take the convention m choose n equals zero when n < m . The PGF equations correspondingto the CME Eq. (28)-(29) are given by ∂G ,s ( z ) ∂t = − kG ,s ( z ) + kG N,s (cid:18) z (cid:19) − rG ,s ( z ) (cid:18) −
11 + α (1 − z ) (cid:19) , (31) ∂G i,s ( z ) ∂t = − kG i,s ( z ) + kG i − ,s ( z ) − rG i,s ( z ) (cid:18) −
11 + α (1 − z ) (cid:19) , i ∈ [2 , N ] , (32)where we have suppressed the time dependence for convenience. Note that while for Model II, thecyclo-stationary condition meant that the protein distribution at a given cell age is independent ofgeneration number, for Model III the condition means that the protein number at a given cell cyclephase is independent of the generation number. Hence we can set Eq. (32) to zero and solve recursivelyfor G i,s ( z ) to obtain G i,s ( z ) = (cid:18) k ( x − k ( x −
1) + rx (cid:19) i − G ,s ( z ) , i ∈ [2 , N ] . (33)Substituting Eq. (33) with i = N in Eq. (31) with the left hand side equal to zero, we obtain G N,s (cid:18) z (cid:19) = k ( x −
1) + rxk ( x − G ,s ( z ) , = (cid:18) k ( x − k ( x −
1) + rx (cid:19) − N G N,s ( z ) . (34)Following the same method of solution as used for solving Eq. (9), we obtain G N,s ( z ) = ∞ (cid:89) s =0 (cid:18) − rxx ( k + r ) − s k (cid:19) N , (35)where we used the normalisation condition for the conditional distribution in each phase, i.e. G i (1) = 1 .Using Eq. (33) and Eq. (35) we obtain the PGF for the conditional protein distribution in cell phase iG i,s ( z ) = (cid:18) rxk ( x − (cid:19) N − i ∞ (cid:89) s =0 (cid:18) − rxx ( k + r ) − s k (cid:19) N , i ∈ [1 , N ] . (36)Since we are considering the case of single lineage measurements, we must average the PGF over allcell phases by marginalising out the phase in which a cell is at observation time. Note that in orderto do this we need an expression for Π i,s , the probability that the cell is in phase i at observation in asingle lineage measurement. For our case of N identically exponentially distributed phases it can beeasily shown that Π i,s = N − , reflecting that every phase is equally likely to be observed. Using thiswe derive the PGF for the protein distribution for lineage measurements G Es ( z ) = N (cid:88) i =1 Π i,s G i,s ( z ) = x − xy (cid:32)(cid:18) xyN ( x − (cid:19) N − (cid:33) ∞ (cid:89) s =0 (cid:18) − xyx ( N + y ) x − s N (cid:19) N , (37)where we used that y = rT and the mean cell cycle length T = N/k . Note that the superscript E denotes an Erlang distributed cell cycle length. This solution can be conveniently written in terms ofexponential functions yielding G Es ( z ) = x − xy (cid:18) exp (cid:18) N log (cid:18) xyN ( x − (cid:19)(cid:19) − (cid:19) exp (cid:32) N ∞ (cid:88) s =0 log (cid:18) − xy ( N + y ) x − s N (cid:19)(cid:33) . (38)9ote that the argument of the log is always positive because x = α ( z − ≤ due to z ≤ . Note alsothat in the limit of large N (at constant T ), the Erlang distribution describing the cell cycle lengthtends to a delta function centered on T , i.e. a cell cycle of fixed length. It is straightforward to showby a series expansion in /N that in the limit of large N , Eq. (38) converges to Eq. (12), i.e. in thelimit of small cell cycle length variability, Model III converges to Model II. The mean and varianceof the protein number distribution in steady-state conditions can be straightforwardly computed fromthe PGF (cid:104) n (cid:105) Es = αy (3 N + 1)2 N , (39) ( σ Es ) = αy (3 N + 1)2 N + α (cid:18) ( N ( N + 10) + 5) y N + (5 N + 3) y N (cid:19) . (40)As expected, the variance of Model III is always larger than that of Model II; the mean of ModelIII is slightly larger than that of Model II but the difference can be ignored in most cases of interest.Both the mean and variance are monotonic decreasing functions of N and hence they are boundedfrom above by the moments evaluated for N = 1 , i.e. an exponentially distributed cell cycle length.The differences in the protein number distributions predicted by Models II and III for lineageobservations are illustrated in Fig. 3a-c. There we show the excellent agreement between the theoreticalexpressions and stochastic simulations for both small y (Fig. 3a) and large y (Fig. 3b,c). While thedifferences between Model I and Model II are due to binomial partitioning, the differences betweenModel II and Model III are due to cell cycle length variability. We find that typically for N > the differences between Models II and III are small; the differences are at their largest when the cellcycle length is exponentially distributed, i.e. N = 1 . Previously we saw how for Model II the negativebinomial was not a good fit for the distribution when y was large. However as can be appreciated fromFig. 3b and 3c, the fit becomes better when we consider cell cycle length variability: the deviationsfrom negative binomial, which manifest in the flattish region around the mode, become less visible as N decreases. Note also that the deviations from negative binomial are largely unaffected by the valueof the mean burst size (increasing α by ten times in Fig. 3 causes the protein distributions to moveright and their height to be rescaled but their shape remains practically unaltered).Since the solution of Model I is generally a negative binomial, it follows that we can renormalise theparameters of this model such that its protein distribution provides a good match to the distributionof Model III when the cell cycle length variability is sufficiently high. Equating the mean and varianceof a negative binomial NB ( z , z ) to Eqs. (39) and (40) we find z = 3(3 N + 1) y ( N ( N + 10) + 5) y + 4 N (5 N + 3) , (41) z = (cid:18) N (3 N + 1) α ( N ( N + 10) + 5) y + 4 αN (5 N + 3) + 1 (cid:19) − . (42)Equating these two parameters ( z , z ) to those of Model I with renormalised parameters ( y e / , α e / (1+ α e ) ), we obtain the relationship between the actual and renormalised parameters α e = z − z = α (cid:18) N + 33(3 N + 1) + y (cid:18) N + 10 N + 56 N (3 N + 1) (cid:19)(cid:19) , (43) y e = 2 z y (cid:18) N + 1) ( N + 10 N + 5) y + 4 N (5 N + 3) (cid:19) . (44)It also follows from these formulae that if we had to fit a negative binomial to experimental datafrom cells with an Erlang distributed cell cycle length (data consistent with Model III) and estimatethe parameters using Model I, then this will lead to an over-estimate for the mean burst size and anunder-estimate for the mean number of mRNAs per cycle (and hence for the transcription rate). Theerrors increase with decreasing N and hence with increasing cell cycle duration variability.10e have here focused on the distributions for single lineage measurements. The distributions ofprotein numbers for population snapshots can also be derived. Due to the rather more complex analysisinvolved, the derivation is presented in Appendix D. Here we simply state the equivalent of Eq. (37)for population measurements G Ep ( z ) = (cid:16) N − (cid:17) N ( x − (cid:16) xyN ( x − + 2 N (cid:17) N − (cid:16) N − (cid:17) N ( x −
1) + xy ∞ (cid:89) s =0 (cid:18) − xyx ( N /N + y ) − N /N s (cid:19) N . (45)The protein distributions corresponding to this PGF are shown in Fig. 3(d)-(f) where they are alsocompared with those of Model II. Note that given the same parameters, the protein distributionsfor lineage and populations observations are considerably different. These differences become moreappreciable with increasing y and decreasing N . While the increase in cell cycle length variability(through decreasing N ) results in little changes to the mode of the lineage distribution, it causes themode of the population distribution to shift to the left. However there are also qualitative similarities,namely that in both cases the deviations from negative binomial are maximal for small cell cycle lengthvariability (large N ) and large y . Similar to what we previously did for lineage observations (see Eqs.(43) and (44)), from the equations for the mean and variance for population snapshots (see AppendixD), it is also possible to calculate the renormalised parameters in Model I such that it provides a goodnegative binomial approximation to the population distribution of Model III when there is sufficientcell cycle length variability.Summarizing in this section we have studied a model (Model III) of bursty gene expression with anErlang distributed cell cycle length. This model recovers Model II in the limit of small cell cycle lengthvariability. Also the presence of sufficient cell cycle length variability is found to lift the deviationsfrom negative binomial observed for Model II; in this case by an appropriate renormalisation of theparameters, Model I can describe the distribution predicted by Model III well. The mean and varianceof protein numbers calculated from a similar model (assuming lineage observations) have been reportedpreviously [22]. We next consider a more general version of Model III: (i) the time spent in phase i of the cell cycleis exponentially distributed with parameter k i , which implies that the cell cycle length distributionis hypoexponential. (ii) the transcription rate and burst size are age dependent, i.e. they are r i and α i in phase i respectively. Note that the Erlang distribution is a special case of the hypoexponentialdistribution and hence the use of this distribution, in principle, allows more flexibility in fitting ex-perimental cell cycle distributions. Note also that modelling the transcription rate as age dependentenables us to capture replication, hence considerably extending the realism of our model. The masterequation describing the above model is a generalisation of that for Model III and is given by dP ( n, t ) dt = − k P ( n, t ) + k N P (cid:48) N ( n, t )+ r ∞ (cid:88) m =0 P ( n − m, t ) Q ( m ) − r P ( n, t ) ∞ (cid:88) m =0 Q ( m ) , (46) dP i ( n, t ) dt = − k i P i ( n, t ) + k i − P i − ( n, t )+ ∞ (cid:88) m =0 r i P i ( n − m, t ) Q i ( m ) − r i P i ( n, t ) ∞ (cid:88) m =0 Q i ( m ) , i ∈ [2 , N ] , (47)11here Q i ( m ) = p i (1 − p i ) m is the geometric distribution with mean α i = (1 − p i ) /p i . Note that P (cid:48) N ( n, t ) is defined as before using Eq. (30). The corresponding PGF equations are given by ∂G ,s ( z ) ∂t = − k G ,s ( z ) + k N G N,s (cid:18) z (cid:19) − r G ,s ( z ) (cid:18) −
11 + α (1 − z ) (cid:19) , (48) ∂G i,s ( z ) ∂t = − k i G i,s ( z ) + k i − G i − ,s ( z ) − r i G i,s ( z ) (cid:18) −
11 + α i (1 − z ) (cid:19) , i ∈ [2 , N ] , (49)where we have suppressed the time dependence for convenience. Setting Eq. (49) to zero (steady-stateconditions) and solving recursively for G i,s ( z ) we obtain G i,s ( z ) = w i,s ( z ) G ,s ( z ) , (50)where w i,s ( z ) = (cid:40) i = 1 , (cid:81) ij =2 k j − ( x j − k j ( x j − r j x j i ∈ [2 , N ] . (51)Note that here we defined x j = α j ( z − . Substituting Eq. (50) with i = N in Eq. (48) with the lefthand side equal to zero, we obtain G N,s (cid:18) z (cid:19) = G N,s ( z ) (cid:18) k ( x −
1) + r x k N ( x − w N ( z ) (cid:19) . (52)Following the same method of solution as used for solving Eq. (9), we obtain G N,s ( z ) = ∞ (cid:89) s =0 N (cid:89) j =1 (cid:18) − r j x j x j ( k j + r j ) − s k j (cid:19) , (53)where again we used the normalisation condition for the conditional distribution in each phase, i.e. G i,s (1) = 1 . Hence finally by means of Eq. (50) and Eq. (53), we can write an equation for the PGFof the distribution assuming single lineage measurements G i,s ( z ) = w i,s ( z ) w N,s ( z ) G N,s ( z ) , (54) G Hs ( z ) = N (cid:88) i =1 Π i,s G i,s ( z ) , (55)where again we let Π i,s denote the probability of observing the cell in phase i in a lineage measurement,which can be shown to be given by Π i,s = k − i / ( (cid:80) j k − j ) . Note the superscript H (standing forhypoexponential) is to distinguish this PGF from the one calculated for Model III using the Erlangdistribution. We next use this theory to understand the effects of gene replication on stochastic gene expression.One of the simplest models of this process assumes (i) the transcription rate to be a constant r before replication, doubling to r right after replication (the doubling in transcription rate is due tothe doubling in gene copy number during replication). (ii) the cell cycle length and the replicationtime are Erlang distributed. This implies the special case where k i = k and α i = α for all i , and r i = r, i ∈ [1 , M ] and r i = 2 r, i ∈ [ M + 1 , N ] , where M is the cell cycle phase after which replication12ccurs. Substituting these values in Eqs. (54) and (55), we obtain G Hs ( z ) = ( x − xy (cid:32)(cid:18) xyN ( x −
1) + 1 (cid:19) N − M (cid:32) (cid:18) xyN ( x −
1) + 1 (cid:19) M − (cid:33) − (cid:33) × ∞ (cid:89) s =0 (cid:18) − xyx ( N + 2 y ) − N s (cid:19) N − M (cid:18) − xyx ( N + y ) − N s (cid:19) M . (56)This can also be written in exponential form as we have previously done for other expressions. Themean and variance of protein fluctuations are given by (cid:104) n (cid:105) Hs = αy ( M ( M −
1) + 2 N (3 N − M + 1))2 N = (cid:104) n (cid:105) Es (cid:18) M + 2( N − M ) N (cid:19) − αyM ( N − M )2 N , (57) (cid:0) σ Hs (cid:1) = αy ( M ( M −
1) + 2 N (3 N − M + 1))2 N + α y (cid:0) N + 2 N (3 − M ) + 3 M ( M − (cid:1) N + α y (cid:32) N + 40 N − M ( M + 3) − N + 12 M (cid:0) M − (cid:1) N − M ( M − N (cid:33) . (58)Using the notation ( σ Es ) (cid:12)(cid:12)(cid:12) vy to denote the variance in Model III, i.e. Eq. (40), when the transcriptionrate is vr we find (cid:0) σ Hs (cid:1) = (cid:20) ( σ Es ) (cid:12)(cid:12)(cid:12) y (cid:18) MN (cid:19) + ( σ Es ) (cid:12)(cid:12)(cid:12) y (cid:18) N − MN (cid:19)(cid:21) − αyM ( N − M )2 N (cid:32) α − αy (cid:0) ( M − + N ( N − M − (cid:1) N (cid:33) . (59)Note that for the special cases M = 0 and M = N , these two equations are the same as the meanand variance of Model III (given by Eqs. (39) and (40)) with y replaced by y and y , respectively; thisis since in this case the transcription rate is the same in all phases of the cell cycle and equal to r and r , respectively. It also follows that Model II is obtained by setting M = N and taking the limit N → ∞ . Hence Model IV contains as special cases the previous Models II and III. Note that whilewe have considered the cases M = 0 and M = N to see the relationships between the various models,when we want to explicitly model replication we need < M < N and N ≥ since there is alwaysa pre-replication and post-replication phase of the cell cycle. In Fig. 4 we show that the theoreticalprotein distributions for each phase of the cell cycle accurately match those obtained from stochasticsimulations.The coefficient of variation squared ( CV s = (cid:0) σ Hs / (cid:104) n (cid:105) Hs (cid:1) ) can be shown to decrease monotonicallywith increasing α and y . However, the dependence on N and M (the replication phase of the cellcycle) is non-monotonic (see Appendix C). In Fig. 5 we show plots of the CV s as a function of all fourparameters which numerically verifies the aforementioned properties. We also observe that the size ofnoise as measured by CV s is almost independent of the replication phase M for large y but increasesmonotonically with M for small y .There is an interesting relationship between Model III and Model IV, as follows. If we renormalisethe parameter y in Model III by changing it to Λ s y where Λ s = 2 − MN − M ( N − M ) N (1 + 3 N ) , (60)then (cid:104) n (cid:105) Hs = (cid:104) n (cid:105) Es and (cid:0) σ Hs (cid:1) ≈ (cid:0) σ Es (cid:1) . Note that while the relationship between the means is exactthis is not true for the variances; the accuracy of the latter, however, is shown for six different parameter13ets in Fig. 6a. Note also that since the mean and variance of Model IV and the renormalised Model IIImatch, it follows that if the distribution in both models is well approximated by a negative binomial(a two parameter distribution) then we expect the two distributions to also match. This is indeed thecase for models with small y and including sufficient cell cycle length variability, i.e. moderate N , ascan be seen in Fig. 6b.However, when the cell cycle length variability decreases as N → ∞ we see that the distributionof Model IV starts to deviate from the renormalised Model III when y grows large as we show in Fig.6c. This can be understood from the observation that the pre-replication and post-replication phase ofthe cell for large y have distinct protein distributions, which was visible in Fig. 4 as well, and the factthat the total observation distribution is a sum of both contributions. For moderate N , i.e, significantcell cycle length variability, the pre-replication and post-replication protein distributions overlap sincethey are wide, which yields good correspondence with Model III as can be seen in Fig. 6b (comparesolid and dashed blue lines). However for N large enough (low cell cycle length variability) and large y , the overlap between the two contributions becomes smaller, resulting in the almost bimodal natureof the distribution in Fig. 6c. Hence it follows that replication effects in Model IV can be describedwell by an appropriately scaled Model III provided cell cycle length variability is not too small. It alsofollows that it can also be well described by Model I with renormalised parameters since the latter wehave shown in Section 4 to be in good agreement with Model III provided y and N are not large.Finally we mention that similar conclusions hold for population snapshot observations as we haveseen for lineage data. A derivation of the snapshot distribution for Model IV can be found in AppendixD. The population equivalent of Eq. (56) is given by G Hp ( z ) = G N,p ( z ) W ( z ) , where G N,p ( z ) = ∞ (cid:89) s =0 (cid:18) − xyx ( N /N + 2 y ) − N /N s (cid:19) N − M (cid:18) − xyx ( N /N + y ) − N /N s (cid:19) M , (61) W ( z ) = (cid:16) N − (cid:17) N ( x − × (cid:16) xyN ( x − + 2 N (cid:17) N − M (cid:18)(cid:16) xyN ( x − + 2 N (cid:17) M − (cid:19)(cid:16) N − (cid:17) N ( x −
1) + xy + (cid:16) xyN ( x − + 2 N (cid:17) N − M − (cid:16) N − (cid:17) N ( x −
1) + 2 xy . (62)As shown in Fig. 6b, the difference between the population snapshot distribution and the lineagedistribution can be significant (compare solid red and blue lines). The mean of the former is lessthan that of the latter which is due to a preponderance of young cells in population measurements (asdiscussed in Section 2). Considering the appropriate rescaling of the parameter y by changing it to Λ p y ,where Λ p = 2 − ( M/N ) , we also get good agreement between Model IV and Model III for populationsnapshots when the cell cycle length variability is moderate (small N ) and y is small (compare dashedand solid red lines in Fig. 6b). This however does not carry through for the case of large y and N (Fig. 6d). Hence while the distributions are appreciably different for population and lineage cases,nevertheless qualitatively the results for the two are similar. Plots of the coefficient of variation squared versus mean protein number have been used in the literatureto separate intrinsic noise from extrinsic noise (see in particular Fig. 2B of Ref. [32]). Specifically,intrinsic noise is associated with the term proportional to the inverse mean since its contributiondecreases with the mean protein number, while extrinsic noise is associated with the term which isindependent of the mean. Within this interpretation, using the expressions previously derived for themean and variance for lineage observations, it is clear that Model I predicts no extrinsic noise ( CV is inversely proportional to mean) while Models II, III and IV predict extrinsic noise stemming frombinomial partitioning and cell cycle length variability. Since Models II and III are special cases of14odel IV, we shall only consider the latter. Using Eqs. (57)-(58) for lineage distributions, we canwrite the coefficient of variation squared in terms of the mean protein expression level CV s = E s, int ( M, N, α ) (cid:104) n (cid:105) Hs + E s, ext ( M, N ) , (63)where the positive functions E s, int and E s, ext are given by E s, int ( M, N, α ) = α (cid:20) M ( M −
1) + 4 N (5 N + 3 − M )3 M ( M −
1) + 6 N (3 N + 1 − M ) (cid:21) + 1 , (64) E s, ext ( M, N ) = 4 N + 40 N − M ( M + 3) − N + 12( M − M N − M ( M − M ( M −
1) + 2 N (3 N + 1 − M )) . (65)This shows that in the limit of abundant proteins, the protein noise as measured by the coefficientof variation squared, tends to a constant, E s, ext ( N, M ) which is independent of the intrinsic proteindynamics. This limiting value is only controlled by the cell cycle length variability via M and N . Theintrinsic noise, on the other hand, is governed by both cell cycle length variability and protein burstsize.To study the dependencies on cell cycle length variability we let M = uN , where u ∈ [0 , representsthe fraction of the cell cycle spent in the non-replicated phase. It can be shown that if u and α are fixedthen the internal component of noise given by the first term in Eq. (63) increases with N while theexternal component given by the second term decreases with N (see Appendix C). Hence unexpectedly, CV s can increase or decrease with cell cycle length variability. More specifically in Appendix C (seeEq. (102)) we show that if y is above a threshold then protein noise increases with increasing cellcycle variability. This condition is met in mammalian cells and yeast cells [33, 34] since the range of y and α is greater than one. In contrast in bacteria, due to the very low mean mRNA produced percell cycle (as low as 0.1 [32]), the condition is not necessarily met and hence it is possible to have CV s decrease with increasing cell cycle length variability (the dependence is however very weak as can beseen in Fig. 7a). Note that one can derive population expressions equivalent to Eqs. (64)-(65). Itcan be proved (see Appendix C) that unlike for lineages, CV p computed from population snapshotsalways increases with increasing cell cycle length variability (see Fig. 7).Next we determine bounds on the external noise floor for N ≥ . First of all, we note that since theextrinsic noise floor E s, ext decreases monotonically with N , its upper bound must be given by N = 1 while its lower bound by N → ∞ . For the case of purely exponential cell cycle length variability, i.e. N = 1 , we then find that E s, int ( u, α ) = (cid:18) α (cid:19) − αu (1 − u )3( u − u + 8) , (66) E s, ext ( u ) = 13 + 4 u (1 − u )(2 − u )(4 − u )3( u − u + 8) , (67)which is maximal when u ≈ . at 0.39. On the other hand, in the opposite limit of deterministic cellcycle times, i.e. N → ∞ , the noise contributions become E s, int ( u, α ) = (cid:18) α (cid:19) − αu (1 − u )9( u − u + 6) , (68) E s, ext ( u ) = 127 + 4 u (1 − u )(7 u − u + 12)27( u − u + 6) , (69)which is minimal when u ≈ . at 0.034. Hence it follows that the external noise floor for N ≥ iscontained in the approximate interval (0.034,0.39).Note that for u = 0 or u = 1 , the special case without a replication phase, Eq. (63) togetherwith Eqs. (66)-(67) almost perfectly recovers a recent result in the literature for the CV of protein15uctuations in a model with exponential cell cycle length variability (see Eq. 7 of Ref [26]), CV =(4 / α + 1 / / (cid:104) n (cid:105) + 1 / . The same result is obtained for population snapshot calculations. Theslight discrepancy is likely caused by the continuum approximation for protein levels in [26].An experimental value of approximately 0.1 has been measured for the extrinsic noise floor in thebacterium E. coli (see Fig. 2B of [32]) which falls within the range of our theory (as described above).From the limited lineage data shown in the Taniguchi et al. paper (Fig. 2C shows three lineages with 9cell division events), it is not clear which Erlang distribution would best fit their data. However if weconsider replication to occur in the middle of the cycle ( M = N/ ) and the cell cycle length to be welldescribed by an Erlang distribution with N = 5 phases, then Eq. (65) predicts a value of E s, ext ≈ . which is remarkably close to the experimental value of 0.1. Note that N = 5 is not unrealistic; it wouldimply a maximum difference of up to / √ N ≈ of the cell cycle length from its mean which isconsistent with some recent experiments in E. coli [27, 35]. The noise decomposition above can alsobe done for population snapshots using the equations for the mean and variance derived in AppendixD. It is found that the theory predicts the same value of E p, ext ≈ . for extrinsic noise, and hence thisresult is insensitive to the type of observations made. In this paper we have derived the PGF corresponding to the stable protein number distributions instochastic gene expression models with cell cycle length variability. Our work has the following specialfeatures: (i) the solution method used allows the derivation of distributions rather than the mean andvariance; (ii) the distributions for cell cycle length variability assumed by the model are of a very generalform (hypoexponential) which fit the majority of experimentally measured distributions; (iii) the modelallows the explicit description of the variation of transcription and burstiness with the position of thecell cycle (the cell age); and (iv) the calculations are done for both lineage and population snapshotobservations which enhances the match between theory and experiments. A necessary underlyingassumption of our approach to compute the PGF is that protein is stable, i.e. its decay occurs purelydue to dilution by cell division. This is a good assumption when protein lifetimes are much longerthan the mean cell cycle time such as in
E. coli [15] and yeast [16]. In mammalian cells, about 70percent of proteins are longer lived than the mean cell cycle time, and hence the approximation is alsoreasonable [33].Special cases of our model can be found in the literature: (i) for a cell cycle composed of N phases,each of which is exponentially distributed in length, and assuming lineage observations, expressions forthe mean and variance have been obtained in Ref. [22, 23]; (ii) for a cycle whose length is exponentiallydistributed, an expression for the approximate protein number distribution (assuming large enoughprotein numbers) for both lineages and population snapshots has been derived in Ref. [26].A major contribution of our study is the comparison of different models of gene expression includingthose with an implicit cell cycle description (Model I) via effective protein degradation and modelswith an explicit cell cycle description with either regular (Model II) or random interdivision times(Models III and IV). We found that the protein distributions of Models II-IV are well approximatedby a negative binomial provided cell cycle length variability is large and y (the mean number of mRNAper cycle) is small (we shall henceforth call these the special conditions). In such cases, the implicitcell cycle model (Model I) with renormalised parameters can describe the results of the explicit cellcycle models. When the special conditions are not met, the distributions show either a flat regionnear the mode or else have a right shoulder which in some cases can almost look like bimodality; ofcourse, Model I cannot capture these distributions. Such a case may be common for gene expressionin mammalian cells and yeast where it is estimated that for many genes, y can take values in the rangeof 1 to about 600 [33, 34]; in contrast in bacteria y has the range . − [32] and hence deviationsfrom negative binomial distributions are likely much less common. Also we have shown that when thespecial conditions are not met, the distributions of models including replication or more complex age-dependent transcriptional dynamics cannot be described by models that assume constant transcription16hrough the cell cycle such as those found in [26, 27]. Our analysis shows that in a model assumingErlang distributed cell cycle duration and replication time, for lineages, the coefficient of variationsquared can either increase or decrease with cell cycle variability whereas for population snapshots,the coefficient of variation squared increases with cell cycle variability. We also show that the thecoefficient of variation squared has a complex dependence on the replication time; it is practicallyindependent of the replication time for large y but increases monotonically with the replication timefor small y . Finally we show that given experimental cell cycle length distributions for E. coli andassuming replication occurs halfway through the cell cycle, our theory predicts a value for extrinsicnoise which is within a few percent of that measured in Ref. [32].Despite its generality, our study has a number of limitations: (i) the analytical approach cannot beextended to derive mRNA distributions. This is since mRNA typically degrades faster than the meancell cycle time [32] and hence is not stable, which is a necessary assumption to solve for the PGF;(ii) we have assumed binomial partitioning. While this assumption presents the simplest reasonablemodel of stochastic partitioning, it likely fails for those proteins which are highly localised [36]; (iii)we have assumed that there is no correlation between the cell cycle duration of mother and daughtercells. Experiments show such a correlation exists [37, 38]. Overcoming these limitations are key toexpanding the realism of the model and are the subject of on-going research.
Acknowledgments
R. P-C. acknowledges support from the UCL Mathematics Clifford Fellowship. R. G. acknowledgessupport from the Leverhulme Trust (RPG-2018-423). C. B. acknowledges the Clarendon Fund andNew College, Oxford, for funding.
AppendixA Derivation of Models I and II from a model with an explicitdescription of mRNA and of a cell cycle of fixed length
A.1 Derivation of Model I
We consider a three-stage gene expression model of the type G σ b −→ G ∗ , G ∗ σ u −−→ G, G ρ u −→ G + M, M d m −−→ ∅ , M h −→ M + P, (70)Note that there is no active protein decay and instead we assume protein decay occurs only dueto binomial partitioning at cell division. The latter is assumed to happen at regularly spaced timeintervals of length T . If the rate of promoter switching is very fast compared to the mRNA and proteintimescales then there is no need to explicitly model G ∗ . Rather it is sufficient to model expressionfrom a single promoter state with an effective transcription rate equal to the true transcription ratemultiplied by the fraction of time that the gene is on. In other words the three-stage model (70)reduces to the two-stage model G r −→ G + M, M d m −−→ ∅ , M h −→ M + P, (71)where r = ρ u σ u / ( σ b + σ u ) is the effective mRNA transcription rate. Let the PGF be defined as G ( z (cid:48) , z, t ) = (cid:88) m,n ( z (cid:48) ) m ( z ) n P ( m, n, t ) , (72)17here P ( m, n, t ) is the probability of observing m mRNAs and n proteins at time t . The PGF thensatisfies a PDE which when non-dimensionalised on the cell cycle time scale, t = T τ , is given by ∂G∂τ = rT ( z (cid:48) − G + d m T (1 − z (cid:48) ) ∂G∂z (cid:48) + hT z (cid:48) ( z − ∂G∂z (cid:48) . (73)Furthermore binomial partitioning under cyclo-stationarity conditions [17] leads to the boundary con-dition (in non-dimensional form) G ( z (cid:48) , z,
0) = G (cid:18) z (cid:48) + 12 , z + 12 , (cid:19) . (74)By using the definitions of the mean numbers of proteins and mRNA (cid:104) n (cid:105) ( τ ) = ∂G ( z (cid:48) , z, τ ) ∂z (cid:12)(cid:12)(cid:12)(cid:12) z (cid:48) = z =1 , (cid:104) m (cid:105) ( τ ) = ∂G ( z (cid:48) , z, τ ) ∂z (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) z (cid:48) = z =1 , (75)it is straightforward to show that the time-evolution of the means is given by the coupled ODEs d (cid:104) n (cid:105) ( τ )d τ = hT (cid:104) m (cid:105) ( τ ) , (76) d (cid:104) m (cid:105) ( τ )d τ = rT − d m T (cid:104) m (cid:105) ( τ ) , (77)with the boundary conditions (cid:104) n (cid:105) (0) = (cid:104) n (cid:105) (1) and (cid:104) m (cid:105) (0) = (cid:104) m (cid:105) (1) . Solving these ODEs one obtains (cid:104) n (cid:105) ( τ ) = hre − d m T τ (cid:0) e τT d m (1 − ( τ + 1) T d m ) + 2 e ( τ +1) T d m (( τ + 1) T d m −
1) + e T d m (cid:1) d m (2 e T d m − , (78)where τ is to be understood as the fractional cell age which equals zero when a cell is born and onejust before a cell divides. Note that we do not show the equation for the mRNA since it is not relevantto our analysis. Hence it follows that the mean number of proteins in single lineage and populationmeasurements is given by (cid:104) n (cid:105) s = (cid:90) (cid:104) n (cid:105) ( τ )d τ = αy (cid:16) d m (3 T d m −
2) + T − T e
Tdm + T (cid:17) T d m , (79) (cid:104) n (cid:105) p = log(2) (cid:90) − τ (cid:104) n (cid:105) ( τ )d τ = αT yd m T log(2) d m + log (2) , (80)where we used y = rT and α = h/d m . Note that here we used the fact that when interdivision timesare regularly spaced in time, the probability of observing a cell of age t ∈ [0 , T ] is uniform for singlelineage measurements and equal to − t/T log 2 /T for population measurements [17] (see also AppendixD for a derivation of the latter). Taking the limit that mRNA decays fast compared to the cell cyclelength, i.e. d m T → ∞ , we obtain (cid:104) n (cid:105) s ≈ αy , (81) (cid:104) n (cid:105) p ≈ αy log(2) . (82)Now one may ask what constitutes an effective system of reactions that describes protein dynamicsand which has the same mean number of proteins as above. If we consider this effective system to begiven by Model I, that is the set of reactions G r −→ G + mP, P d (cid:48) −→ ∅ , (83)18here m is a geometrically distributed random number with mean α , it follows that the mean numberof proteins is given by (cid:104) n (cid:105) I = rαd (cid:48) = αyd (cid:48) T . (84)Equating Eq. (84) with Eq. (81), we obtain the effective protein degradation rates for single lineagemeasurements, d (cid:48) = 23 T . (85)Similarly equating Eq. (84) with Eq. (82), we find that the effective protein degradation rates forpopulation measurements is d (cid:48) = log(2) T . (86)
A.2 Derivation of Model II
Starting from the same mRNA model as used in A.1 we now derive Model II. Consider the PDE Eq.(73) with boundary condition Eq. (74). To solve this PDE, we employ the method of characteristicswhich yields the system of equations d τ d s = 1 , (87) d z (cid:48) d s = d m T ( z (cid:48) −
1) + hT z (cid:48) (1 − z ) , (88) d z d s = 0 , (89) d G d s = rT ( z (cid:48) − G, (90)where s is the characteristic variable. In particular we see that τ = s and z = z is constant.To consider the case of unstable mRNA we define ε = 1 / ( d m T ) as the asymptotic variable. Wesee that to achieve a dominant balance as ε → (or d m T → ∞ ) we could choose hT = O ( d m T ) . Asbefore we use, α = lim d m T →∞ ( hT /d m T ) and taking this scaling we see that as ε → we get ε d z (cid:48) d s = z (cid:48) − αz (cid:48) (1 − z ) + O ( ε ) , (91)and therefore have to leading order ( z (cid:48) − − αz (cid:48) ( z −
1) = 0 . This results in z (cid:48) → / (1 − α ( z − quickly in the limit of ε → and therefore the following effective PDE for the PGF ∂G∂τ = rT α ( z − − α ( z − G. (92)Note that we can then define ˜ G ( z, τ ) = lim z (cid:48) → G ( z (cid:48) , z, τ ) = (cid:80) n y n P ( n, τ ) , i.e. the PGF for themarginal protein distribution. It follows now that ˜ G satisfies the following cyclo-stationary system inthe limit of ε → ∂ ˜ G∂τ = rT α ( z − − α ( z −
1) ˜ G, (93a) ˜ G ( z,
0) = ˜ G (cid:18) z + 12 , (cid:19) . (93b)Note that this is the same system satisfied by G ( z, t ) from Model II, Eqs. (4) and (7) in the cyclo-stationary limit where we drop the superscript for the generation. Therefore we have shown that thetwo-stage expression model (71) with binomial partitioning converges to model II in the limit of fastmRNA decay compared to the cell cycle length, i.e. d m T (cid:29) .19 Stochastic simulations
Stochastic simulations in this work were carried out using a method similar to the First DivisionAlgorithm in [27], which effectively is the (modified) next reaction method with the addition of celldivision and observation events. Below follows a description of the exact procedures used to calculatestochastic realisations of Models II, III and IV in this paper. The code used for this work can be foundonline at https://bitbucket.org/CasperBeentjes/protein-cell-cycle-ssa . Lineage simulation
For the simulation of lineage data we consider a single cell in a mother machine that we continuouslytrack. Measurements of the cell’s contents are taken at intervals defined by some observation timedistribution. For this paper we take measurements at regular intervals, i.e. a delta distribution for theobservation interval distribution.1.
Initialisation.
Start a cell in the mother machine at time t = 0 with initial molecule content n .Assign a phase j and age within the phase, a .2. Generate waiting times τ r , τ p and τ o , the time until the next biochemical reaction, the next phasechange and the next observation respectively.3. Until t ≥ t final . Pick ∆ = min( τ r , τ p , τ o ) , update all waiting times ( τ r , τ p , τ o ) by τ → τ − ∆ andupdate phase age a → a + ∆ . Based on the minimum found for the waiting times proceed to:a. Biochemical reaction.
Select the biochemical reaction occurring, e.g. using Gillespie-typealgorithm, and update cell molecule content. Generate a new time until next biochemicalreaction τ r .b. Phase progression.
Based on the current phase of the cell proceed to:i. If phase of the cell is N , reset the new phase of the cell to the first phase, j → .Binomially partition the cell contents across the cell in mother machine and daughtercell, discard daughter cell and keep following cell in mother machine.ii. Otherwise, set the new phase of the cell, j → j + 1 .Set the age of cell in the new phase to a = 0 and update the biochemical rate parametersaccording to the current phase j . Generate a new time until next phase progression τ p anda new time until next biochemical reaction τ r .c. Observation.
Write the current cell contents and other quantities of interest to disk. Gen-erate a new time until next observation τ o .Update simulation time t → t + ∆ and return to the start of step 3. Population snapshot simulation
For the simulation of population snapshot data we consider an initial batch of cells that we let growand divide until some final time t final at which point we measure the contents of all the cells present.1. Initialisation.
Start batch of cells at time t i = 0 with initial molecule contents n i , where i indexrolls over the batch of present cells. Assign a phase j i and age within the phase, a i , to each cellin the batch. Start with the first cell in the batch, l → .2. Until all cells in batch have reached final time t final pick the next cell in the batch and proceedto step 3.3. Generate waiting times τ r and τ p the time until the next biochemical reaction and the next phasechange respectively. Set the time until observation as τ o → t final − t l .20. Until t i = t final . Pick ∆ = min( τ r , τ p , τ o ) , update all waiting times ( τ r , τ p , τ o ) by τ → τ − ∆ andupdate phase age a → a + ∆ . Based on the minimum found for the waiting times proceed to:a. Biochemical reaction.
Select biochemical reaction occurring, e.g. using Gillespie-type al-gorithm, and update cell molecule content. Generate a new time until next biochemicalreaction τ r .b. Phase progression.
Based on the current phase of the cell proceed to:i. If phase of the cell is N , reset the new phase of the cell to the first phase, j → .Binomially partition the cell contents across the current cell and a daughter cell. Addthe daughter cell with its content to the batch of cells with start time t i (cid:48) = t l , a i (cid:48) = 0 and j i (cid:48) = 1 .ii. Otherwise, set the new phase of the cell, j → j + 1 .Set the age of cell in the new phase to a = 0 and update the biochemical rate parametersaccording to the current phase j . Generate a new time until next phase progression τ p anda new time until next biochemical reaction τ r .c. Observation.
Write the current cell contents and other quantities of interest to disk. Let l → l + 1 and return to step 2.Update simulation time t l → t l + ∆ and return to the start of step 4. C Coefficient of variation proofs for Model IV
Effect of burst size α and mRNA production per cycle y First we prove that CV (and CV ) monotonically decrease with increasing α and y . We note usingEqs. (57)-(58) that the mean and variance of protein numbers in Model IV with replication can bewritten as (cid:0) (cid:104) n (cid:105) Hs (cid:1) = α y A, (cid:0) σ Hs (cid:1) = αyB + α yC + α y D, where A, B, C, D are functions of
M, N solely. First we observe that
A > must hold by the observationthat the mean of the Model IV is necessarily positive if α > and y > . This means that we have forthe squared coefficient of variation CV = 1 A (cid:18) Bαy + Cy + D (cid:19) . Note that CV ≥ must hold for all α ≥ and y ≥ . By considering the limit α ↓ we deduce that B ≥ . Similarly, by considering the limit y ↓ we find that for all α ≥ we must have B + Cα ≥ and therefore C ≥ . Finally from the limit y → ∞ we deduce that D ≥ . It then immediately followsthat CV and CV decrease monotonically with α and/or y increasing. Note that this argument appliesequally well to the population snapshot case. Effect of cell cycle variability N for lineages Next we show that the extrinsic noise floor, i.e. the limit of CV for abundant proteins, is a monotonicdecreasing function of N . Recalling Eq. (65) and re-writing the noise floor E s, ext ( M, N ) using M = uN ,where u ∈ [0 , , we have E s, ext ( u, N ) = N (cid:0) − u + 12 u − u + 4 (cid:1) + N (cid:0) u − u + 40 (cid:1) − u − u + 203 ( N ( u − u + 6) − u + 2) . (94)21e then note that the derivative of the noise floor with respect to N is given by d E s, ext d N = − (cid:0) ( − u + 16 u − u + 40) + N (3 u − u + 58 u − u + 56) (cid:1) N ( u − u + 6) + (2 − u )) < . (95)The inequality follows from considering every term in brackets for N > and u ∈ [0 , and showingit is positive. As a result, we see that the external noise floor is a monotonic decreasing function ofincreasing N , which can be interpreted as decreasing cell cycle length variability.Next we show that CV s is not monotonic in N for single cell lineage measurements. Recalling Eqs.(57) and (64), we can re-write the internal noise component and protein mean using u and N as E s, int ( u, N ) = 1 + 2 α − αN (2 − u )3 N (6 − u + u ) + 3(2 − u ) , (96) (cid:104) n (cid:105) Hs = (cid:18) − u + u − u N (cid:19) αy. (97)Note that it then follows that d E s, int d N = − (cid:32) − u ) N ( u − u + 6) + (2 − u )) (cid:33) α < , (98) d (cid:104) n (cid:105) Hs d N = − (cid:18) − u N (cid:19) αy < , (99)which does not allow us to immediately make a conclusion about the behaviour of the ratio of the twoquantities. Therefore we note that the internal component of the coefficient of variation squared infact grows as N increases as shown by d( E s, int / (cid:104) n (cid:105) Hs )d N = 2(2 − u ) (cid:0) α (cid:0) N (cid:0) u − u + 2 (cid:1) + 3(2 − u ) (cid:1) + 3 N (cid:0) u − u + 6 (cid:1) + 3(2 − u ) (cid:1) αy ( N ( u − u + 6) + (2 − u )) > . (100)Since the external component of the coefficient of variation squared has earlier been proved to decreasewith N , it follows that for some positive functions a , a of u, N, α we find d CV s d N = a y − a , (101)which shows that CV s can be both increasing or decreasing as a function of N , depending on the exactvalues of α, y . In fact, by considering a and a and their extreme behaviour (which is for u = 1 ) onecan show that for y > N N + 1 α N N ) , (102)the protein noise for lineages is monotonically decreasing, whereas when this condition is not satisfiedwe can find u = M/N ∈ [0 , so that, perhaps counter-intuitively, decreasing cell cycle variability, i.e.increasing N , leads to an increase in protein noise. On the other hand this shows that if mRNA pro-duction is large enough the protein noise does become monotonic as a function of cell cycle variability. Effect of cell cycle variability N for population snapshots For population snapshots we note from results in Appendix D, Eqs. (130) and (133) in particular, thatthe internal noise component is composed of E p, int ( u, N ) = 1 + 2 α − α (2 − u )2 u N (cid:0) /N − (cid:1) /N , (103) (cid:104) n (cid:105) Hp = 2 − u N (2 /N − αy. (104)22rom here it is easy to show that (cid:104) n (cid:105) Hp and E p, int ( u, N ) monotonically grow and decay respectively as N increases. As a result we note that their ratio, the internal noise component ( E p, int / (cid:104) n (cid:105) Hp ) , decaysmonotonically when N increases, i.e. the opposite scenario to what happens for single cell lineagemeasurements.Finally we will show that E p, ext decays monotonically when N increases and as a result that, incontrast to the lineage framework, the total protein noise CV p is always decreasing as a function ofincreasing N , regardless of the values of α, y . We note that this proof is more cumbersome than forthe lineage measurements and we only sketch the details here. We start by noting that, with u ∈ [0 , as before, E p, ext ( u, N ) = − u N (cid:0) /N − (cid:1) /N (cid:16) u (cid:16) (cid:16) N − (cid:17) N ( u − + 2 N (16 − u ) + 3 u − (cid:17) − · /N (cid:17) + ( − u + 3 · u − (105)To make computations slightly more tractable we then make the transformation to q = 2 /N (whichyields q > for N > and q → when N → ∞ ) and note that d q/ d N < for all N . This leaves usto show that d E p, ext / d q > for all q > in order to show that the external noise component satisfies d E p, ext / d N < . We start with h ( q ) = d E p, ext d q = 2 u log 26( q log q ) (cid:124) (cid:123)(cid:122) (cid:125) > h ( q ) . (106)From here we note that lim q ↓ h ( q ) = lim q ↓ h (cid:48) ( q ) = lim q ↓ h (cid:48)(cid:48) ( q ) = 0 and d h d q = 1 q (cid:124)(cid:123)(cid:122)(cid:125) > h ( q ) . (107)The new function h satisfies for u ∈ [0 , q ↓ h ( q ) = 6 (cid:0) u (cid:0) u log(8) − u log(2) + 2 + log(4096) (cid:1) − (cid:1) > h d q = 1 q (cid:124)(cid:123)(cid:122)(cid:125) > h ( q ) . In a similar fashion we proceed to show that for h and u ∈ [0 , we can write lim q ↓ h ( q ) = 2 (2 u ( u (( u −
4) log(8) −
48) + 78 + log(4096)) − > q ↓ h (cid:48) ( q ) = 2 (cid:0) · u (cid:0) u log(2) − u (11 + log(4)) + 38 + log(16) (cid:1) − (cid:1) > h d q = 8 q (cid:124)(cid:123)(cid:122)(cid:125) > h ( q ) , where h ( q ) = q (2 u (16 − u ) −
12) + ( − · u u + 5 · u − is a positive linear function in q when u ∈ [0 , . By arriving at h > for q > we find cascading back that h (cid:48)(cid:48) , h (cid:48) , h > for q > . Thisthen proves h (cid:48) , h > for q > which in turn shows that h (cid:48)(cid:48)(cid:48) , h (cid:48)(cid:48) , h (cid:48) , h > for q > . Finally havingproven that h > we find d E p, ext / d q = h > for q > , which was what we needed to prove to showthat d E p, ext / d N < . 23 Population snapshots
In this Appendix, we correct for the fact that in a population observation, the cell phases and ob-servation times differ from those in single-cell measurements. In order to construct the PGF for thepopulation observations we start from the expression G p ( z ) = N (cid:88) i =1 Π i,p G i,p ( z ) , (108)where G i,p ( z ) denotes the PGF for phase i in the population measurement case and Π i,p is the proba-bility to observe a cell in phase i . Note that we take the distributions for each phase to be normalisedto unity to reflect that the G i,p are in fact the marginal distributions for protein in each phase i . Wewill next show how to derive expressions for G i ( z ) and Π i,p . Phase distribution
While the probability of observing a cell in cell cycle phase i , denoted as Π i , was obvious whenperforming single-cell measurements, this is not the case for population measurements. This is sinceeach time a cell divides, two cells start in phase and hence we generally expect that the probability ofbeing in phase i decreases with i . We next derive an expression for the probability Π i,p for populationmeasurements.Let the average number of cells under cyclo-stationary assumptions in cell cycle phase i at time t be denoted by (cid:104) C i ( t ) (cid:105) . Then it immediately follows from the model specification that when each phaseof the cell cycle is exponentially distributed with parameter k , the time-evolution equations are givenby d (cid:104) C ( t ) (cid:105) d t = − k (cid:104) C ( t ) (cid:105) + 2 k (cid:104) C N ( t ) (cid:105) , (109) d (cid:104) C i ( t ) (cid:105) d t = − k (cid:104) C i ( t ) (cid:105) + k (cid:104) C i − ( t ) (cid:105) , i = 2 , . . . , N. (110)The factor 2 in the first equation stands for cellular division: every time a cell divides (leaving phase N ), two cells start in phase 1. In the case that we are tracking single cell lineages then the factor 2 isreplaced by 1. More generally, for cases with asymmetric division (after division some cells differentiate)this factor 2 can be replaced by a factor ν ∈ [0 , . We are interested in finding the probability that acell is in phase i , which is given by Π i ( t ) = (cid:104) C i ( t ) (cid:105) (cid:80) Ni =1 (cid:104) C i ( t ) (cid:105) . (111)Note that this expression is valid for both population snapshot and lineage observations. While thisquantity will change with time, it will eventually approach a steady-state value and this is what weare here interested in. We make the ansatz that given long enough time, (cid:104) C i ( t ) (cid:105) = λ i (cid:104) C ( t ) (cid:105) where λ i are some time-independent constants which need to be determined (except for λ = 1 which followsimmediately). Substituting this assumption in Eq. (111) we find Π i,p = λ i (cid:80) Nj =2 λ j . (112)Hence next we determine the values of λ i . Substituting the ansatz in Eqs. (109)-(110), we find therelationship: λ i − λ i = 2 λ N , i = 2 , . . . , N, (113)with λ = 1 . Solving this recurrence relation one finds λ i = 2 (1 − i ) /N . (114)24ubstituting the latter expression in Eq. (112) and simplifying one finally obtains Π i,p = 2 /N − ( i/N ) − , i = 1 , . . . , N, (115)which in the limit of N → ∞ yields the familiar age structure in a population which doubles at mitosis,i.e. f ( t ) = 2 − t/T log 2 /T . Note that this result is in contrast to the phase distribution in the singlecell measurement case, which was given by a uniform distribution, i.e. Π i,s = 1 /N for all i . In apopulation we are more likely to observe cells in an early cell phase compared to lineage data. Notethat incidentally we can derive the population growth rate (see next Section) from this formalism,which is necessary to derive the age distribution in a population for each phase. Population growth rate
To derive the population growth rate, we consider (cid:104) C ( t ) (cid:105) = (cid:80) i (cid:104) C i ( t ) (cid:105) , the expected total number ofcells, i.e. the size of the population. From Eqs. (109) and (110) we then find in the long-time limitwhen the proportion of cells in each phase relative to the total number of cells, i.e. Π i,p , becomes stablethat the growth of the cell population is given by d (cid:104) C ( t ) (cid:105) d t = k (cid:104) C N ( t ) (cid:105) = k Π N,p (cid:104) C ( t ) (cid:105) = k (2 /N − (cid:104) C ( t ) (cid:105) . (116)This shows that the average number of cells in the population grows like (cid:104) C ( t ) (cid:105) ∝ e λt where λ = k (2 /N − . Note that in the limit of N → ∞ and k = N/T this becomes λ = log 2 /T , showing thatthe population doubles in size after every cell cycle when we consider the case of a deterministic cellcycle of length T . Population age distribution
Let’s consider (cid:104) C i ( t, τ ) (cid:105) the average number of cells in a population that are in cell phase i at time t that have been in that cell phase for a duration τ , i.e. are of an age τ . After a small time duration δ ,all the cells will either advance to an age τ + δ or advance to the next cell phase. Therefore we canwrite the conservation equation (cid:104) C i ( t + δ, τ + δ ) (cid:105) = (cid:104) C i ( t, τ ) (cid:105) − (cid:104) C i ( t, τ ) (cid:105) kδ, (117)where k is the rate of advancing to the next cell phase. Assuming that there is a stationary distributionfor the age of the cell population at a phase i , π i ( τ ) , we can write (cid:104) C i ( t, τ ) (cid:105) as (cid:104) C i ( t, τ ) (cid:105) = (cid:104) C i ( t ) (cid:105) π i ( τ ) ,where (cid:104) C i ( t ) (cid:105) as before is the expected number of cells in cell phase i at time t . Introducing thisfactorisation of (cid:104) C i ( t, τ ) (cid:105) in Eq. (117), and taking the limit δ → , we get the differential equation d (cid:104) C i ( t ) (cid:105) d t π i ( τ ) + d π i ( τ )d τ (cid:104) C i ( τ ) (cid:105) = −(cid:104) C i ( t ) (cid:105) π i ( τ ) k, (118)where we have used the chain rule to compute the derivative of d (cid:104) C i ( x, x ) (cid:105) / d x | t,τ . Given that in thelong time limit, the population of cells is growing at an average rate λ , which we derived in the previoussection, and since the probability of finding a cell in a certain phase i , Π i , is constant in time, thenumber of cells in a given phase has to grow with the same rate as the population. This implies thattherefore d (cid:104) C i ( t ) (cid:105) / d t = λ (cid:104) C i ( t ) (cid:105) . Introducing this equality in Eq. (118), we get an equation solely for π i ( τ ) λπ i ( τ ) + d π i ( τ )d τ = − π i ( τ ) k, (119)which can be solved by π i ( τ ) = ( λ + k ) e − ( λ + k ) τ = k /N e − k /N τ . (120)25inally we note that there is a direct link between the age distribution at measurement and thephase length distribution on a population level. In order to achieve a stable distribution for the ageat measurement in the long-time limit we note that the population phase length distribution has tobe consistent with the age distribution. Since the population age distribution at observation in eachphase is still exponential, but now with rate k /N , the distribution for the time it takes to progressto the next cell phase in a population must also be exponentially distributed with the same rate. Thismeans that in a population of cells we observe a quicker progression through the phases (in distributionon a population level) than when following a single cell. This can be understood from the observationthat cells that quickly progress through their phases (and thus divide quickly) will be more abundantin populations than in single cell lineages. Population snapshot observation distribution
By the results of the previous section, we can derive the distribution for population snapshot measure-ments from the single cell measurement statistics via a rescaling of the phase progression rates. Thedistribution for the length of the cell phases is still exponential which means the Markovian natureof the problem remains intact in the population case. This implies that in all the equations for thePGF derived earlier (for single lineages) we replace k (cid:55)→ k /N to find the corresponding populationequations. The recipe for calculating population observations therefore is1 To obtain G i,p ( z ) replace k (cid:55)→ k /N in the expressions for the lineage distributions G i,s ( z ) , e.g.in Eq. (50).2 Average over the different phases via Eq. (108) using Π i,p from Eq. (115).Using these steps we can then solve for the snapshot observation distribution for Model IV withreplication which we will write as G p ( z ) = G N,p ( z ) W ( z ) , where G N,p ( z ) = ∞ (cid:89) s =0 (cid:18) − xyx ( N /N + 2 y ) − N /N s (cid:19) N − M (cid:18) − xyx ( N /N + y ) − N /N s (cid:19) M , (121) W ( z ) = (cid:16) N − (cid:17) N ( x − × (cid:16) xyN ( x − + 2 N (cid:17) N − M (cid:18)(cid:16) xyN ( x − + 2 N (cid:17) M − (cid:19)(cid:16) N − (cid:17) N ( x −
1) + xy + (cid:16) xyN ( x − + 2 N (cid:17) N − M − (cid:16) N − (cid:17) N ( x −
1) + 2 xy . (122)This result then yields the explicit snapshot distribution for all the other models considered in thispaper. We recover the special case of Model III by taking M = N which in turn simplifies theobservation distribution to G N,p ( z ) = ∞ (cid:89) s =0 (cid:18) − xyx ( N /N + y ) − N /N s (cid:19) N , (123) W ( z ) = (cid:16) N − (cid:17) N ( x − (cid:16) xyN ( x − + 2 N (cid:17) N − (cid:16) N − (cid:17) N ( x −
1) + xy . (124)Finally we obtain a generalised version of Model II by taking N → ∞ ; we call this generalisedbecause unlike the one in the main text, here for completeness we return to u = M/N so that we can26odel replication in Model II. This yields G N,p ( z ) = ∞ (cid:89) s =0 exp (cid:18) (1 − u )2 xy s − x (cid:19) exp (cid:18) uxy s − x (cid:19) = exp (cid:32) (2(1 − u ) + u ) xy ∞ (cid:88) s =0 s − x (cid:33) , (125) W ( z ) = (log 2) ( x − exp (cid:16) (1 − u )( x (2 y +log 2) − log 2) x − (cid:17) (cid:16) exp (cid:16) u ( x ( y +log 2) − log 2) x − (cid:17) − (cid:17) x ( y + log 2) − log 2+ exp (cid:16) (1 − u )( x (2 y +log 2) − log 2) x − (cid:17) − x (2 y + log 2) − log 2 . (126)Note that for u = 1 this further simplifies to G p ( z ) = ( x −
1) log 2 xy + ( x −
1) log 2 (cid:18) exp (cid:18) xy + ( x −
1) log 2 x − (cid:19) − (cid:19) exp (cid:32) xy ∞ (cid:88) s =0 s − x (cid:33) , (127)which is the population equivalent of Model II presented in the main text. In particular we can nowwrite the mean of protein numbers for generalised Model II, III and IV respectively (cid:104) n (cid:105) p = 2 − u log 2 αy, (128) (cid:104) n (cid:105) Ep = 1 N (2 /N − αy, (129) (cid:104) n (cid:105) Hp = 2 − ( M/N ) N (2 /N − αy, (130)where we recall u = M/N in the limit of N → ∞ . We can also write expressions for the variances forgeneralised Models II, III and IV σ p = 2 − u log 2 αy + 4 ( u log 2 − − u α y + 2 − u (cid:0) − u (cid:0) u log − u (2) + 4 log 2 (cid:1) + 2 u +1 (3 + 2 log 2)) − (cid:1) log α y , (131) (cid:0) σ Ep (cid:1) = 1 N (2 /N − αy + (cid:16) N − (cid:17) N −
13 2 − N α y + (cid:16) N − (cid:17) N − − N (3 N + 1)3 N α y , (132) (cid:0) σ Hp (cid:1) = 2 − ( M/N ) N (2 /N − αy + 4 (cid:16) · − MN + 2 − /N (2 N − M ) + M − N (cid:17) (cid:16) N − (cid:17) N α y + − − N (cid:16) (cid:16) N − (cid:17) (cid:16) − M + 2 N (cid:0) − M N + ( M − M + 4 N (cid:1) + 4 M N + M − N (cid:17)(cid:17) (cid:16) N − (cid:17) N + (cid:16) MN (cid:16) − N +2 N + 4 N + 2 N +2 N (8 N + 3) (cid:17) + 3 (cid:16) N (2 N + 3) − N (cid:17) M + N +1 N − N +2 N (cid:17) (cid:16) N − (cid:17) N α y . (133)27ote that generalised Model II for u = 1 simplifies to Model II in the main text (cid:104) n (cid:105) p = αy log 2 , σ p = 1log 2 αy + 6 − α y + 1 − log α y (cid:39) . αy + 1 . α y + 0 . α y . References [1] Michael B Elowitz, Arnold J Levine, Eric D Siggia, and Peter S Swain. Stochastic gene expressionin a single cell.
Science , 297(5584):1183–1186, 2002.[2] Mads Kærn, Timothy C Elston, William J Blake, and James J Collins. Stochasticity in geneexpression: from theories to phenotypes.
Nature Reviews Genetics , 6(6):451, 2005.[3] David Schnoerr, Guido Sanguinetti, and Ramon Grima. Approximation and inference methodsfor stochastic biochemical kinetics—a tutorial review.
Journal of Physics A: Mathematical andTheoretical , 50(9):093001, 2017.[4] Vahid Shahrezaei and Peter S Swain. Analytical distributions for stochastic gene expression.
Proceedings of the National Academy of Sciences , 105(45):17256–17261, 2008.[5] Justine Dattani and Mauricio Barahona. Stochastic models of gene transcription with upstreamdrives: exact solution and sample path characterization.
Journal of The Royal Society Interface ,14(126):20160833, 2017.[6] Nikola Popović, Carsten Marr, and Peter S Swain. A geometric analysis of fast-slow models forstochastic gene expression.
Journal of mathematical biology , 72(1-2):87–122, 2016.[7] Dmitri Bratsun, Dmitri Volfson, Lev S Tsimring, and Jeff Hasty. Delay-induced stochastic oscilla-tions in gene regulation.
Proceedings of the National Academy of Sciences , 102(41):14593–14598,2005.[8] Niraj Kumar, Abhyudai Singh, and Rahul V Kulkarni. Transcriptional bursting in gene expression:analytical results for general stochastic models.
PLoS computational biology , 11(10):e1004292,2015.[9] Tianhai Tian and Kevin Burrage. Stochastic models for regulatory networks of the genetic toggleswitch.
Proceedings of the national Academy of Sciences , 103(22):8372–8377, 2006.[10] Ramon Grima, Deena R Schmidt, and Timothy J Newman. Steady-state fluctuations of a geneticfeedback loop: An exact solution.
The Journal of chemical physics , 137(3):035104, 2012.[11] Philipp Thomas, Nikola Popović, and Ramon Grima. Phenotypic switching in gene regulatorynetworks.
Proceedings of the National Academy of Sciences , 111(19):6994–6999, 2014.[12] Zhixing Cao and Ramon Grima. Linear mapping approximation of gene regulatory networks withstochastic dynamics.
Nature communications , 9(1):3305, 2018.[13] Ido Golding, Johan Paulsson, Scott M Zawilski, and Edward C Cox. Real-time kinetics of geneactivity in individual bacteria.
Cell , 123(6):1025–1036, 2005.[14] Nir Friedman, Long Cai, and X Sunney Xie. Linking stochastic dynamics to population dis-tribution: an analytical framework of gene expression.
Physical review letters , 97(16):168302,2006.[15] MR Maurizi. Proteases and protein degradation inescherichia coli.
Experientia , 48(2):178–201,1992. 2816] Archana Belle, Amos Tanay, Ledion Bitincka, Ron Shamir, and Erin K O’Shea. Quantification ofprotein half-lives in the budding yeast proteome.
Proceedings of the National Academy of Sciences ,103(35):13004–13009, 2006.[17] Otto G Berg. A model for the statistical fluctuations of protein numbers in a microbial population.
Journal of theoretical biology , 71(4):587–603, 1978.[18] Iain G Johnston and Nick S Jones. Closed-form stochastic solutions for non-equilibrium dynamicsand inheritance of cellular components over many cell divisions.
Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Sciences , 471(2180):20150050, 2015.[19] Adrienne HK Roeder, Vijay Chickarmane, Alexandre Cunha, Boguslaw Obara, BS Manjunath,and Elliot M Meyerowitz. Variability in the control of cell division underlies sepal epidermalpatterning in arabidopsis thaliana.
PLoS biology , 8(5):e1000367, 2010.[20] A Golubev. Applications and implications of the exponentially modified gamma distribution as amodel for time variabilities related to cell proliferation and gene expression.
Journal of theoreticalbiology , 393:203–217, 2016.[21] Duarte Antunes and Abhyudai Singh. Quantifying gene expression variability arising from ran-domness in cell division times.
Journal of Mathematical Biology , 71(2):437–463, 2015.[22] Mohammad Soltani and Abhyudai Singh. Effects of cell-cycle-dependent expression on randomfluctuations in protein levels.
Royal Society open science , 3(12):160578, 2016.[23] Mohammad Soltani, Cesar A Vargas-Garcia, Duarte Antunes, and Abhyudai Singh. Intercellu-lar variability in protein levels from stochastic expression and noisy cell cycle processes.
PLoScomputational biology , 12(8):e1004972, 2016.[24] Lydia Robert, Jean Ollion, Jérôme Robert, Xiaohu Song, Ivan Matic, and Marina Elez. Mutationdynamics and fitness effects followed in single cells.
Science , 359(6381):1283–1286, 2018.[25] John RS Newman, Sina Ghaemmaghami, Jan Ihmels, David K Breslow, Matthew Noble, Joseph LDeRisi, and Jonathan S Weissman. Single-cell proteomic analysis of s. cerevisiae reveals thearchitecture of biological noise.
Nature , 441(7095):840, 2006.[26] Jakub Jędrak, Maciej Kwiatkowski, and Anna Ochab-Marcinek. Exactly solvable model of geneexpression in a proliferating bacterial cell population with stochastic protein bursts and proteinpartitioning.
Physical Review E , 99(4):042416, 2019.[27] Philipp Thomas. Making sense of snapshot data: ergodic principle for clonal cell populations.
Journal of The Royal Society Interface , 14(136):20170467, 2017.[28] Christian A Yates, Matthew J Ford, and Richard L Mort. A multi-stage representation of cellproliferation as a markov process.
Bulletin of mathematical biology , 79(12):2905–2928, 2017.[29] Ji Yu, Jie Xiao, Xiaojia Ren, Kaiqin Lao, and X Sunney Xie. Probing gene expression in livecells, one protein molecule at a time.
Science , 311(5767):1600–1603, 2006.[30] Ping Wang, Stacey Hayden, and Yoshio Masui. Transition of the blastomere cell cycle from cellsize-independent to size-dependent control at the midblastula stage in xenopus laevis.
Journal ofExperimental Zoology , 287(2):128–144, 2000.[31] Clotilde Cadart, Sylvain Monnier, Jacopo Grilli, Pablo J Sáez, Nishit Srivastava, Rafaele Attia,Emmanuel Terriac, Buzz Baum, Marco Cosentino-Lagomarsino, and Matthieu Piel. Size controlin mammalian cells involves modulation of both growth rate and cell cycle duration.
Naturecommunications , 9(1):3275, 2018. 2932] Yuichi Taniguchi, Paul J Choi, Gene-Wei Li, Huiyi Chen, Mohan Babu, Jeremy Hearn, AndrewEmili, and X Sunney Xie. Quantifying e. coli proteome and transcriptome with single-moleculesensitivity in single cells.
Science , 329(5991):533–538, 2010.[33] Björn Schwanhäusser, Dorothea Busse, Na Li, Gunnar Dittmar, Johannes Schuchhardt, JanaWolf, Wei Chen, and Matthias Selbach. Global quantification of mammalian gene expressioncontrol.
Nature , 473(7347):337, 2011.[34] Christian Miller, Björn Schwalb, Kerstin Maier, Daniel Schulz, Sebastian Dümcke, BenediktZacher, Andreas Mayer, Jasmin Sydow, Lisa Marcinowski, Lars Dölken, et al. Dynamic transcrip-tome analysis measures rates of mrna synthesis and decay in yeast.
Molecular systems biology ,7(1), 2011.[35] Mikihiro Hashimoto, Takashi Nozoe, Hidenori Nakaoka, Reiko Okura, Sayo Akiyoshi, KunihikoKaneko, Edo Kussell, and Yuichi Wakamoto. Noise-driven growth rate gain in clonal cellularpopulations.
Proceedings of the National Academy of Sciences , 113(12):3251–3256, 2016.[36] Dann Huh and Johan Paulsson. Non-genetic heterogeneity from stochastic partitioning at celldivision.
Nature genetics , 43(2):95, 2011.[37] Dan Siegal-Gaskins and Sean Crosson. Tightly regulated and heritable division control in singlebacterial cells.
Biophysical Journal , 95(4):2063–2072, 2008.[38] Bram Cerulus, Aaron M New, Ksenia Pougach, and Kevin J Verstrepen. Noise and epigeneticinheritance of single-cell division times influence population fitness.
Current Biology , 26(9):1138–1147, 2016. 30 (a) - (b)Figure 1: (a) Plot of the relative error between the steady-state variance of Model I and the cyclo-stationary varianceof Model II as a function of the mean burst size ( α ) and the mean number of mRNAs produced in one cell cycle ( y ),assuming lineage measurements. (b) Plot of the relative error between the skewness of Model I and II as a function of α and y . Note that for burst sizes α (cid:39) , the relative errors are a strong function of y , with only a weak dependenceon α . Model I in all cases underestimates the variance of Model II. In contrast, Model I underestimates the skewness ofModel II for small y and overestimates it for large y . The relative error in the variance is defined as ( σ s − σ NB,s ) /σ s and the relative error in the skewness is defined as ( S s − S NB,s ) /S s . (a) α = 1 , y = 5 (b) α = 1 , y = 20 (c) α = 1 , y = 40
50 100 150 200 2500.0000.0050.0100.0150.0200.025 (d) α = 1 , y = 100 Figure 2:
Comparison of the protein number distributions (assuming lineage measurements) of Model I and Model II insteady-state and cyclo-stationary conditions, respectively. For Model I (red line), the distribution is NB (3 y/ , α/ (1 + α )) and for Model II (black line) it is given by P ( n ) = (1 /n !) dG ns /dz n | z =0 where G s is given by Eq. (12). Note that thenumerical computation of the distribution from Eq. (12) can be greatly accelerated (whilst maintaining accuracy) if thesum over s is truncated to a few tens of terms. The dots show the distributions of Model I and Model II obtained fromstochastic simulations (simulations of Model I are done using the conventional SSA and those of Model II are done usinga modified SSA, see Appendix B and also at the end of this caption). The open green circles show the negative binomialdistribution which has the same first and second moments as the distribution of Model II. The stochastic simulationsof Model II are performed as follows. Initially we have a single cell with zero protein. We measure the protein contentof the cell at intervals T/Z where Z = 10 π . Each time a cell divides, we follow only one of the daughter cells. Thesimulation is run until cycles have passed and a histogram is calculated from this data (we discard the first cellcycles to ignore any possible transients). The cell cycle length is T = 1 in all cases. All lineage simulations in this articleuse this protocol, unless otherwise stated.
20 40 60 80 1000.000.010.020.030.040.05 0 20 40 60 80 1000.000.010.020.030.040.050 100 200 300 4000.0000.0020.0040.0060.0080.0100.012 0 100 200 300 4000.0000.0020.0040.0060.0080.0100.0120 100 200 300 400 500 600 7000.0000.0010.0020.0030.0040.0050.006 0 100 200 300 400 500 600 7000.0000.0010.0020.0030.0040.0050.006
Figure 3:
Plots of the protein number distribution P ( n ) for Model II (blue line) and Model III (black, green, cyanlines) with single lineage (a-c) and population observations (d-f). The solid dots show the distributions obtained fromstochastic simulations using the SSA (see Appendix B) which agree with those from theory in all cases. Note that in thelimit of large number of cycle cycle phases N (cid:29) , Model III approaches Model II since the cell cycle length variabilitytends to zero. The non negative binomial nature of Model II for large y (the flat region near the mode in b and c and theright shoulder in e and f) is washed away as the cell cycle length variability increases, i.e. as N decreases in Model III.Note that the numerical computation of the distribution of Model III from Eqs. (37) and (45) can be greatly accelerated(whilst maintaining accuracy) if the infinite product is truncated to a few tens of terms.
20 40 60 80 100 1200.000.010.020.030.040.05
Figure 4:
Plots of the protein distribution at different points in the cell cycle for Model IV with an Erlang distributedcell cycle (10 phases) and replication occurring at the start of phase 6. The distributions (solid lines) are calculated fromthe theory for lineage observations while the dots show the same calculated from a single trajectory generated by theSSA for cell cycles. Note that the theory is for Model IV Eq. (54) with i = 1 , , , k i = k = N/T and α i = α forall i , and r i = r, i ∈ [1 , M ] and r i = 2 r, i ∈ [ M + 1 , N ] , where M is the cell cycle stage in which replication occurs. Theparameters are N = 10 , M = 5 , α = 1 , y = 20 , T = 1 . .0 0.2 0.4 0.6 0.8 1.00.040.050.060.070.080.090.10 (a) (b)Figure 5: Effects of the position of the replication point in the cell cycle on the size of protein fluctuations for lineageobservations. Plots of the coefficient of variation squared CV s as a function of the mean burst size α , the mean numberof mRNAs produced in a cell cycle y and the fraction of the cell cycle in the pre-replication stage M/N for Model IV(same setup as previous figure). The mean and variance for the computation of the CV s are given by Eqs. (57) and (58).A comparison of (a) and (b) shows that the CV s decreases with increasing α and y . However the CV s has a complexdependence on M : for large y , CV s is roughly independent of M while it increases with M for small y . Solid dots showthe results of the SSA and solid lines show the theory. a ) ( b ) ( c ) ( d ) Figure 6:
Relationship between Model III and Model IV for lineage (denoted by subscript s in figure) and populationobservations (denoted by subscript p in figure). (a) renormalising the parameter y in Model III (lineage) by changing itto Λ s y , we find that the variance agrees to a good degree of approximation with the variance of Model IV (lineage) forall parameter sets tested. The mean of the two models is exactly the same under this renormalisation. The same appliesfor population observations (not shown). (b) The protein distribution of Model IV and of the renormalised Model IIIare also in excellent agreement (these are obtained from the PGF solutions of both models) for small y and N . Note therenormalisation factors are given by Eq. (60) for lineage and Λ p = 2 − M/N for population observations. (c) In contrastthe distributions of Model IV (lineage) and of renormalised Model III (lineage) are very different when y and N arelarge. In (d) we show that the same conclusion holds for population observations. In all cases, dots show the distributionobtained from the SSA is in good agreement with the theory (solid or dashed lines) for Models III and IV. Parametersets are as follows. The 6 parameter sets for Model IV in (a) are: (1) N = 4 , α = 10 , y = 10 , (2) N = 4 , α = 20 , y = 100 ,(3) N = 20 , α = 1 , y = 10 , (4) N = 20 , α = 2 , y = 50 , (5) N = 10 , α = 5 , y = 100 , and (6) N = 10 , α = 2 , y = 10 ;for renormalised Model III, the parameters are the same except that y is renormalised. For each parameter set we take M = 0 , . . . , N . Parameters for (b) are N = 2 , α = 1 . , y = 20 , M = 1 for Model IV and same but with y = 20Λ s,p for renormalised Model III. Parameters for (c, d) are N → ∞ , α = 1 . , y = 100 , M = 0 . N for Model IV and samefor renormalised Model III, but with y = 100Λ s,p . Population snapshot SSA data for Model IV consists of ∼ cellsstarting from a single cell with zero protein content (see Appendix B for details of the simulations). (a) (b) (c)Figure 7: Effects of the cell cycle variability on the size of protein fluctuations showing the difference between lineage(solid) and population snapshot (dotted) observations. Plots of the coefficient of variation squared CV as a functionof the mean burst size α , the mean number of mRNAs produced in a cell cycle y and the fraction of the cell cycle inthe pre-replication stage u = M/N . The mean and variance for the computation of the CV s are given by Eqs. (57)and (58), whereas for CV p we have used Eqs. (130) and (133). A comparison of (a), (b) and (c) shows that the CV s can be both increasing and decreasing with increasing N , but CV p is always decreasing with increasing N , confirmingtheoretical predictions. For large values of y , we see that the protein noise only weakly depends on u , which can beunderstood from noting that this is the regime in which protein noise is dominated by the extrinsic noise floor., which can beunderstood from noting that this is the regime in which protein noise is dominated by the extrinsic noise floor.