The Estimation of the Effective Reproductive Number from Disease Outbreak Data
Ariel Cintrón-Arias, Carlos Castillo-Chávez, Luís M. A. Bettencourt, Alun L. Lloyd, H. T. Banks
TThe Estimation of the Effective Reproductive Number from DiseaseOutbreak Data
Ariel Cintr´on-Arias , ∗ , Carlos Castillo-Ch´avez , Lu´ıs M. A. Bettencourt ,Alun L. Lloyd , and H. T. Banks Statistical and Applied Mathematical Sciences Institute,19 T. W. Alexander Drive, P.O. Box 14006,Research Triangle Park, NC 27709-4006 Department of Mathematics and Statistics,Arizona State University, P.O. Box 871804,Tempe, AZ 85287 - 1804 Theoretical Division, Mathematical Modeling and Analysis (T-7),Los Alamos National Laboratory, Mail Stop B284,Los Alamos, NM 87545 Biomathematics Graduate Program and Department of Mathematics,North Carolina State University, Raleigh, NC 27695 Center for Research in Scientific Computation,North Carolina State University, P.O. Box 8205,Raleigh, NC 27695April 22, 2008 ∗ corresponding author: [email protected] a r X i v : . [ q - b i o . P E ] A p r bstract We consider a single outbreak susceptible-infected-recovered (SIR) model and correspondingestimation procedures for the effective reproductive number R ( t ). We discuss the estimationof the underlying SIR parameters with a generalized least squares (GLS) estimation technique.We do this in the context of appropriate statistical models for the measurement process. We useasymptotic statistical theories to derive the mean and variance of the limiting (Gaussian) sam-pling distribution and to perform post statistical analysis of the inverse problems. We illustratethe ideas and pitfalls (e.g., large condition numbers on the corresponding Fisher informationmatrix) with both synthetic and influenza incidence data sets. Keywords
Effective reproductive number, basic reproduction ratio, reproduction number, R , R ( t ), R ,parameter estimation, ordinary least squares, generalized least squares. The transmissibility of an infection can be quantified by its basic reproductive number, R ,defined as the mean number of secondary infections seeded by a typical infective into a com-pletely susceptible (naive) host population [1, 14, 20]. For many simple epidemic processes, thisparameter determines a threshold: whenever R >
1, a typical infective gives rise to more thanone secondary infection, leading to an epidemic. In contrast, when R <
1, infectives typicallygive rise to less than one secondary infection and the prevalence of infection cannot increase.Due to the natural history of some infections, transmissibility is better quantified by theeffective—rather than the basic—reproductive number. For instance, exposure to influenza inprevious years confers some cross-immunity [11, 17, 25]; the strength of this protection de-pends on the antigenic similarity between the current year’s strain of influenza and earlier ones.Consequently, the population is non-naive and so it is more appropriate to consider the effec-tive reproductive number, R ( t ), a time-dependent quantity that accounts for the population’sreduced susceptibility.Our goal is to develop a methodology for the estimation of R ( t ) that also provides a measureof the uncertainty in the estimates. We apply the proposed methodology in the context ofannual influenza outbreaks, focusing on data for influenza A (H3N2) viruses, which were, with he exception of the influenza seasons 2000-1 and 2002-3, the dominant flu subtype in the USover the period from 1997 to 2005 [9, 29].The estimation of reproductive numbers is typically an indirect process because some of theparameters on which these numbers depend are difficult, if not impossible, to quantify directly.A commonly used indirect approach involves fitting a model to some epidemiological data,providing estimates of the required parameters.In this study we estimate the effective reproductive number by fitting a deterministic epi-demiological model employing either an Ordinary Least-Squares (OLS) or a Generalized Least-Squares (GLS) estimation scheme to obtain estimates of model parameters. Statistical asymp-totic theory [13, 27] and sensitivity analysis [12, 26] are then applied to give approximate sam-pling distributions for the estimated parameters. Uncertainty in the estimates of R ( t ) is thenquantified by drawing parameters from these sampling distributions, simulating the correspond-ing deterministic model and then calculating effective reproductive numbers. In this way, thesampling distribution of the effective reproductive number is constructed at any desired timepoint.The statistical methodology provides a framework within which the adequacy of the param-eter estimates can be formally assessed for a given data set. We shall present instances in whichthe fitted model appears to provide an adequate fit to a given data set but where the statis-tics reveal that the parameter estimates have very high levels of uncertainty. We also discusssituations in which the fitted model appears, at least visually, to provide an adequate fit andwhere the statistics suggest that the uncertainty in the parameters is not so large but that, inreality, a poor fit has been achieved. We discuss the use of residuals plots as a diagnostic forthis outcome, which highlights the problems that arise when the assumptions of the statisticalmodel underlying the estimation framework are violated.This manuscript is organized as follows: In Section 2 the data sets are introduced. A single-outbreak deterministic model is introduced in Section 3. Section 4 introduces the least squaresestimation methodology used to estimate values for the parameters and quantify the uncertaintyin these estimates. Our methodology for obtaining estimates of R ( t ) and its uncertainty is alsodescribed. Use of these schemes is illustrated in Section 5, in which they are applied to syntheticdata sets. Section 6 applies the estimation machinery to the influenza incidence data sets. Weconclude with a discussion of the methodologies and their application to the data sets. N u m b e r o f H N i s o l a t es Figure 1: Influenza isolates reported by the CDC in the US during the 1999-2000 season [9]. Thenumber of H3N2 cases (isolates) is displayed as a function of time. Time is measured as the numberof weeks since the start of the year’s flu season. For the 1999-2000 flu season, week number onecorresponds to the fortieth week of the year, falling in October.4
Longitudinal Incidence Data
Influenza is one of the most significant infectious diseases of humans, as witnessed by the 1918“Spanish Flu” pandemic, during which 20 to 40 percent of the worldwide population becameinfected. At least 50 million deaths resulted, with 675,000 of these occurring in the US [30]. Theimpact of flu is still significant during inter-pandemic periods: the Centers for Disease Controland Prevention (CDC) estimate that between 5 and 20 percent of the US population becomesinfected annually [9]. These annual flu outbreaks lead to an average of 200,000 hospitalizations(mostly involving young children and the elderly) and mortality that ranges between about 900and 13,000 deaths per year [29].The Influenza Division of the CDC reports weekly information on influenza activity in theUS from calendar week 40 in October through week 20 in May [9], the period referred to asthe influenza season. Because the influenza virus exhibits a high degree of genetic variability,data is not only collected on the number of cases but also on the types of influenza viruses thatare circulating. A sample of viruses isolated from patients undergoes antigenic characterization,with the type, subtype and, in some instances, the strain of the virus being reported [9].The CDC acknowledges that, while these reports may help in mapping influenza activity(whether or not it is increasing or decreasing) throughout the US, they do not provide enoughinformation to calculate how many people became ill with influenza during a given season. TheCDC’s caution likely reflects the uncertainty associated with the sampling process that givesrise to the tested isolates, in particular that this process is not sufficiently standardized acrossspace and time. We return to discuss this point later in this paper.Despite the cautionary remarks by the CDC we use these isolate reports as illustrativedata sets to which we can apply our proposed estimation methodologies. Interpretation of theresults, however, should be mindful of the issues associated with the data. The total number oftested specimens and isolates through various seasons are summarized in Table 1. It is observedthat H3N2 viruses predominated in most seasons with the exception of 2000-1 and 2002-3.Consequently, we focus our attention on the H3N2 subtype. Figure 1 depicts the number ofH3N2 isolates reported over the 1999-2000 influenza season.
The model that we use is the standard Susceptible-Infective-Recovered (SIR) model (see, forexample, [1, 6]). The state variables S ( t ), I ( t ), and R ( t ) denote the number of people who are usceptible, infective, and recovered, respectively, at time t . It is assumed that newly infectedindividuals immediately become infectious and that recovered individuals acquire permanentimmunity. The influenza season, lasting nearly 32 weeks [9], is short compared to the averagelifespan, so we ignore demographic processes (births and deaths) as well as disease-inducedfatalities and assume that the total population size remains constant. The model is given bythe following set of nonlinear differential equations dSdt = − βS IN (1) dIdt = βS IN − γI (2) dRdt = γI. (3)Here, β is the transmission parameter and γ is the (per-capita) rate of recovery, the reciprocalof which gives the average duration of infection. Notice that one of the differential equations isredundant because the three compartments sum to the constant population size: S ( t ) + I ( t ) + R ( t ) = N . We choose to track S ( t ) and I ( t ). The initial conditions of these state variables aredenoted by S ( t ) = S and I ( t ) = I .The equation for the infective population (2) can be rewritten as dIdt = γ ( R ( t ) − I, (4)where R ( t ) = S ( t ) N R and R = β/γ . R ( t ) is known as the effective reproductive number,while R is known as the basic reproductive number. We have that R ( t ) ≤ R , with theupper bound—the basic reproductive number—only being achieved when the entire populationis susceptible.We note that R ( t ) is the product of the per-infective rate at which new infections arise andthe average duration of infection, and so the effective reproductive number gives the averagenumber of secondary infections caused by a single infective, at a given susceptible fraction. Theprevalence of infection increases or decreases according to whether R ( t ) is greater than or lessthan one, respectively. Because there is no replenishment of the susceptible pool in this SIRmodel, R ( t ) decreases over the course of an outbreak as susceptible individuals become infected. Estimation Schemes
In order to calculate R ( t ), it is necessary to know the two epidemiological parameters β and γ ,as well as the number of susceptibles, S ( t ), and the population size, N . As mentioned before,difficulties in the direct estimation of β —whose value reflects the rate at which contacts occurin the population and the probability of transmission occurring when a susceptible and infectivemeet—and direct estimation of S ( t ) preclude direct estimation of R ( t ). As a result, we adoptan indirect approach, which proceeds by first finding the parameter set for which the model hasthe best agreement with the data and then calculating R ( t ) by using these parameters and themodel-predicted time course of S ( t ). Simulation of the model also requires knowledge of theinitial values, S and I , which must also be estimated.Although the model is framed in terms of the prevalence of infection I ( t ), the time-seriesdata provides information on the weekly incidence of infection, which, in terms of the model, isgiven by the integral of the rate at which new infections arise over the week: (cid:82) βS ( t ) I ( t ) /N dt .We notice that the parameters β and N only appear (both in the model and in the expressionfor incidence) as the ratio β/N , precluding their separate estimation. Consequently we needonly estimate the value of this ratio, which we call ˜ β = β/N .We employ inverse problem methodology to obtain estimates of the vector θ = ( S , I , ˜ β, γ ) ∈ R p = R by minimizing the difference between the model predictions and the observed data,according to two related but distinct least squares criteria, ordinary least squares (OLS) andgeneralized least squares (GLS). In what follows, we refer to θ as the parameter vector, orsimply the parameter, in the inverse problem, even though some of its components are initialconditions, rather than parameters, of the underlying dynamic model. The least squares estimation methodology is based on the mathematical model as well as a statistical model for the observation process (referred to as the case counting process). It isassumed that our known model, together with a particular choice of parameters— the “true”parameter vector, written as θ —exactly describes the epidemic process, but that the n obser-vations, Y j , are affected by random deviations (e.g., measurement errors) from this underlyingprocess. More precisely, it is assumed that Y j = z ( t j ; θ ) + (cid:15) j for j = 1 , . . . , n (5) here z ( t j ; θ ) denotes the weekly incidence given by the model under the true parameter, θ ,and is defined by the following integral: z ( t j ; θ ) = (cid:90) t j t j − ˜ βS ( t ; θ ) I ( t ; θ ) dt. (6)Here, t denotes the time at which the epidemic observation process started and the weeklyobservation time points are written as t < . . . < t n .The errors, (cid:15) j , are assumed to be independent and identically distributed ( i.i.d. ) randomvariables with zero mean ( E [ (cid:15) j ] = 0), representing measurement error as well as other phe-nomena that cause the observations to deviate from the model predictions z ( t j ; θ ). The i.i.d. assumption means that the errors are uncorrelated across time and that each has identical vari-ance, which we write as var ( (cid:15) j ) = σ . It is assumed that σ is finite. We make no furtherassumptions about the distribution of the errors: in particular, we do not assume that they arenormally distributed. It is immediately clear that we have E [ Y j ] = z ( t j ; θ ) and var ( Y j ) = σ :in particular, this variance is longitudinally constant (i.e, across the time points).For a given set of observations Y = ( Y , . . . , Y n ), we define the estimator θ OLS as follows: θ OLS ( Y ) = θ nOLS ( Y ) = arg min θ ∈ Θ n (cid:88) j =1 [ Y j − z ( t j ; θ )] . (7)Here Θ represents the feasible region for the parameter values. (We discuss this region in moredetail later.) This estimator is a random variable (note that (cid:15) j = Y j − z ( t j ; θ ) is a randomvariable) that involves minimizing the distance between the data and the model prediction. Wenote that all of the observations are treated as having equal importance in the OLS formulation.If { y j } nj =1 is a realization of the case counting (random) process { Y j } nj =1 , we define the costfunction by J ( θ ) = n (cid:88) j =1 [ y j − z ( t j ; θ )] (8)and observe that the solution of ˆ θ OLS = ˆ θ nOLS = arg min θ ∈ Θ J ( θ ) (9)provides a realization of the random variable θ OLS .The optimization problem in Equation (9) can, in principle, be solved by a wide variety of al-gorithms. The results discussed in this paper were obtained by using a direct search method, theNelder-Mead simplex algorithm, as discussed by [22], employing the implementation provided y the MATLAB (The Mathworks, Inc.) routine fminsearch .Because var ( (cid:15) j ) = E ( (cid:15) j ) = σ , the true variance satisfies σ = 1 n E n (cid:88) j =1 [ Y j − z ( t j ; θ )] . (10)Because we do not know θ , we base our estimate of the error variance on an equation of thisform, but instead of using θ we use the estimated parameter vector, ˆ θ OLS . The right side ofEquation (10) is then equal to J (ˆ θ OLS ) /n . This estimate, however, is biased and so instead thefollowing bias-adjusted estimate is usedˆ σ OLS = 1 n − J (ˆ θ OLS ) . (11)Here the n − p = 4 parameters have been estimated from the data.Even though the distribution of the errors is not specified, asymptotic theory can be usedto describe the distribution of the random variable θ OLS [3, 27]. Provided that a number ofregularity conditions as well as sampling conditions are met (see [27] for details), it can beshown that, asymptotically (i.e., as n → ∞ ), θ OLS is distributed according to the followingmultivariate normal distribution: θ OLS = θ nOLS ∼ N ( θ , Σ n ) , (12)where Σ n = n − σ Ω − and Ω = lim n →∞ n χ ( θ , n ) T χ ( θ , n ) . (13)We remark that the theory requires that this limit exists and that the matrix Ω be non-singular.The matrix Σ n is the 4 × cov (( θ OLS ) i , ( θ OLS ) j ), andthe n × χ ( θ , n ) is the sensitivity matrix of the system, as defined and discussed below.In general, θ , σ , and Σ n are unknown, so these quantities are approximated by the estimatesˆ θ OLS and ˆ σ OLS , and the following matrixΣ n ≈ ˆΣ nOLS = ˆ σ OLS (cid:104) χ (ˆ θ OLS , n ) T χ (ˆ θ OLS , n ) (cid:105) − . (14)Consequently, for large n , we have approximately that θ OLS = θ nOLS ∼ N (cid:18) ˆ θ OLS , ˆ σ OLS (cid:104) χ (ˆ θ OLS , n ) T χ (ˆ θ OLS , n ) (cid:105) − (cid:19) . (15) e obtain the standard error for the i -th element of ˆ θ OLS by calculating (cid:114)(cid:16) ˆΣ nOLS (cid:17) ii .The n × χ ( θ, n ) that appear in the above formulae are called sensitivity matrices and are defined by χ ji ( θ, n ) = ∂z ( t j ; θ ) ∂θ i , ≤ j ≤ n, ≤ i ≤ . (16)The sensitivity matrix denotes the variation of the model output with respect to the parameter,and, for this model-based dynamical system, can be obtained using standard theory [2, 12, 16,19, 21, 26].The entries of the j -th row of χ ( θ, n ) denote how the weekly incidence at time t j changes inresponse to changes in the parameter (i.e., in either S , I , ˜ β , or γ ) and these can be calculatedby ∂z∂S ( t j ; θ ) = ˜ β (cid:90) t j t j − (cid:20) I ( t ; θ ) ∂S∂S ( t ; θ ) + S ( t ; θ ) ∂I∂S ( t ; θ ) (cid:21) dt (17) ∂z∂I ( t j ; θ ) = ˜ β (cid:90) t j t j − (cid:20) I ( t ; θ ) ∂S∂I ( t ; θ ) + S ( t ; θ ) ∂I∂I ( t ; θ ) (cid:21) dt (18) ∂z∂ ˜ β ( t j ; θ ) = (cid:90) t j t j − (cid:20) S ( t ; θ ) I ( t ; θ ) + ˜ β (cid:18) I ( t ; θ ) ∂S∂ ˜ β ( t ; θ ) + S ( t ; θ ) ∂I∂ ˜ β ( t ; θ ) (cid:19)(cid:21) dt (19) ∂z∂γ ( t j ; θ ) = ˜ β (cid:90) t j t j − (cid:20) I ( t ; θ ) ∂S∂γ ( t ; θ ) + S ( t ; θ ) ∂I∂γ ( t ; θ ) (cid:21) dt. (20)We see that these expressions involve the partial derivatives of the state variables, S ( t ; θ ) and I ( t ; θ ), with respect to the parameters. Analytic forms of the sensitivities are not availablebecause the state variables are the solutions of a nonlinear system; instead, they are calculatednumerically.In order to outline how these numerical sensitivities may be found, we introduce the notation x ( t ; θ ) = ( S ( t ; θ ) , I ( t ; θ )) and denote by g = ( g , g ) the vector function whose entries are givenby the expressions on the right sides of Equations (1) and (2). Then we can write the single-outbreak SIR model in the general vector form dxdt ( t ; θ ) = g ( x ( t ; θ ); θ ) (21) x (0; θ ) = ( θ , θ ) . (22)Because the function g is differentiable (in both t and θ ), taking the partial derivatives ∂/∂θ ofboth sides of Equation (21) we obtain the differential equation ddt ∂x∂θ = ∂g∂x ∂x∂θ + ∂g∂θ . (23) ere ∂g/∂x is a 2-by-2 matrix, ∂g/∂θ is a 2-by-4 matrix, and ∂x/∂θ is the 2-by-4 matrix ∂x∂θ = ∂S∂S ∂S∂I ∂S∂ ˜ β ∂S∂γ∂I∂S ∂I∂I ∂I∂ ˜ β ∂I∂γ . (24)Numerical values of the sensitivities are calculated by solving (21) and (23) simultaneously.We define φ ( t ) = ∂x∂θ ( t ; θ ), let the parameter be evaluated at the estimate, θ = ˆ θ , and solve thefollowing differential equations from t = 0 to t = t n ddt x ( t ) = g ( x ( t ; ˆ θ ); ˆ θ ) (25) ddt φ ( t ) = ∂g∂x φ ( t ) + ∂g∂θ (26) x (0) = (ˆ θ , ˆ θ ) (27) φ (0) = . (28) The errors in the statistical model defined by Equation (5) were assumed to have constant vari-ance, which may not be an appropriate assumption for all data sets. One alternative statisticalmodel that can account for more complex error structure in the case counting process is thefollowing Y j = z ( t j ; θ ) (1 + (cid:15) j ) . (29)As before, it is assumed the (cid:15) j are i.i.d. random variables with E ( (cid:15) j ) = 0 and var ( (cid:15) j ) = σ < ∞ ,but no further assumptions are made. Under these assumptions, the observation mean is againequal to the model prediction, E [ Y j ] = z ( t j ; θ ), while the variance in the observations is afunction of the time point, with var ( Y j ) = σ z ( t j ; θ ). In particular, this variance is non-constant and model-dependent. One situation in which this error structure may be appropriateis when observation errors scale with the size of the measurement (so-called relative noise ).Given a set of observations Y = ( Y , . . . , Y n ), the estimator θ GLS = θ GLS ( Y ) is defined asthe solution of the normal equations n (cid:88) j =1 w j [ Y j − z ( t j ; θ )] ∇ θ z ( t j ; θ ) = 0 , (30)where the w j are a set of non-negative weights [13]. Unlike the ordinary least squares for- ulation, this definition assigns different levels of influence, described by the weights, to thedifferent observations in the cost function. For the error structure described above in Equation(29), the weights are taken to be inversely proportional to the square of the predicted incidence: w j = 1 / [ z ( t j ; θ )] . We shall also consider weights taken to be proportional to the reciprocal ofthe predicted incidence; these correspond to assuming that the variance in the observations isproportional to the value of the model (as opposed to its square).Suppose { y j } nj =1 is a realization of the case counting process { Y j } nj =1 and define the function L ( θ ) as L ( θ ) = n (cid:88) j =1 w j [ y j − z ( t j ; θ )] (31)The quantity θ GLS is a random variable and a realization of it, denoted by ˆ θ GLS , is obtainedby solving n (cid:88) j =1 w j [ y j − z ( t j ; θ )] ∇ θ z ( t j ; θ ) = 0 , (32)In the limit as n → ∞ , the GLS estimator θ GLS has the following asymptotic properties [3, 13]: θ GLS = θ nGLS ∼ N ( θ , Σ n ) (33)where Σ n ≈ σ (cid:2) χ ( θ , n ) T W ( θ ) χ ( θ , n ) (cid:3) − . (34)Here W ( θ ) = diag ( w ( θ ) , . . . , w n ( θ )) with w j ( θ ) = 1 / [ z ( t j ; θ )] . The sensitivity matrix χ ( θ , n ) is as defined in Section 4.1.Because θ and σ are unknown, the estimate ˆ θ GLS is used to calculate approximations of σ and the covariance matrix Σ n by σ ≈ ˆ σ GLS = 1 n − L (ˆ θ GLS ) (35)Σ n ≈ ˆΣ nGLS = ˆ σ GLS (cid:104) χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n ) (cid:105) − . (36)As before, the standard errors for ˆ θ GLS can be approximated by taking the square roots of thediagonal elements of the covariance matrix ˆΣ nGLS .The cost function used in GLS estimation involves weights whose values depend on thevalues of the fitted model. These values are not known before carrying out the estimationprocedure and consequently GLS estimation is implemented as an iterative process. An OLSis first performed on the data, and the resulting model values provide an initial set of weights. weighted least squares fit is then performed using these weights, obtaining updated modelvalues and hence an updated set of weights. The weighted least squares process is repeated untilsome convergence criterion is satisfied, such as successive values of the estimates being deemedto be sufficiently close to each other. The process can be summarized as follows1. Estimate ˆ θ GLS by ˆ θ (0) using the OLS Equation (9). Set k = 0;2. Form the weights ˆ w j = 1 / [ z ( t j ; ˆ θ ( k ) )] ;3. Define L ( θ ) = (cid:80) nj =1 ˆ w j [ y j − z ( t j ; θ )] . Re-estimate ˆ θ GLS by solvingˆ θ ( k +1) = arg min θ ∈ Θ L ( θ )to obtain the k + 1 estimate ˆ θ ( k +1) for ˆ θ GLS ;4. Set k = k + 1 and return to 2. Terminate the procedure when successive estimates forˆ θ GLS are sufficiently close to each other.The convergence of this procedure is discussed in [7, 13].
Let the pair (ˆ θ, ˆΣ) denote the parameter estimate and covariance matrix obtained with eitherthe OLS or GLS methodology from a given realization { y j } nj =1 of the case counting process.Simulation of the SIR model then allows the time course of the susceptible population, S ( t ; ˆ θ ),to be generated. The time course of the effective reproductive number can then be calculatedas R ( t ; ˆ θ ) = S ( t ; ˆ θ )ˆ˜ β/ ˆ γ . This trajectory is our central estimate of R ( t ).The uncertainty in the resulting estimate of R ( t ) can be assessed by repeated sampling ofparameter vectors from the corresponding sampling distribution obtained from the asymptotictheory, and applying the above methodology to calculate the R ( t ) trajectory that results eachtime. To generate m such sample trajectories, we sample m parameter vectors, θ ( k ) , from the4-multivariate normal distribution N (ˆ θ, ˆΣ). We require that each θ ( k ) lie within our feasibleregion, Θ. If this is not the case, then we resample until θ ( k ) ∈ Θ. Numerical solution of the SIRmodel using θ ( k ) allows the sample trajectory R ( t ; θ ( k ) ) to be calculated. Below, we summarizethese steps involved in the construction of the sampling distribution of the effective reproductivenumber:1. Set k = 1; . Obtain the k -th parameter sample from the 4-multivariate normal distribution: θ ( k ) ∼ N (ˆ θ, ˆΣ);3. If θ ( k ) / ∈ Θ (constraints are not satisfied) return to 2. Otherwise go to 4;4. Using θ = θ ( k ) find numerical solutions, denoted by (cid:0) S ( t ; θ ( k ) ) , I ( t ; θ ( k ) ) (cid:1) , to the nonlinearsystem defined by Equations (1) and (2). Construct the effective reproductive number asfollows: R ( t ; θ ( k ) ) = S ( t ; θ ( k ) ) ˜ β ( k ) γ ( k ) where θ ( k ) = (cid:16) S ( k )0 , I ( k )0 , ˜ β ( k ) , γ ( k ) (cid:17) ;5. Set k = k + 1. If k > m then terminate, otherwise return to 2.Uncertainty estimates for R ( t ) are calculated by finding appropriate percentiles of the dis-tribution of the R ( t ) samples. We illustrate the OLS methodology and investigate its performance using synthetic data. Atrue parameter θ is chosen and a set of synthetic data is constructed by adding random noiseto the model prediction of incidence (for every time point t j ) in the following manner: Y j = z ( t j ; θ ) + cU j . (37)Here, U j is a standard normal random variable ( U j ∼ N (0 , c is the productof a pre-selected value, α , and the minimum value of the simulated incidence: c = α (cid:20) min ≤ i ≤ n z ( t i ; θ ) (cid:21) . (38)The multiplier α allows us to control the variance of the noise, while the use of the minimumincidence is an attempt to reduce the occurrence of negative values in the synthetic data set. Itis clear from Equation (37) that the noise added to the synthetic data has constant variance,given by var ( cU j ) = c . A realization of the case counting process is denoted by { y j } nj =1 with y j = z ( t j ; θ ) + cu j , where all the u j ’s are independently drawn from a standard normal istribution.The optimization routine requires an initial estimated value of the parameter; this is takento be θ = (1 + a ) θ , where a also denotes a pre-selected multiplier. Selecting different valuesfor α and a allows us to investigate the performance of the estimation process in the face ofdifferent levels of noise and differing levels of information as to the approximate location of thebest fitting model parameter (i.e., the “true” parameter).A synthetic data set with n = 1 ,
000 observations was constructed by setting α = 0 .
50. Theinitial guess was set using a = 0 .
25. Then m = 10 ,
000 sample trajectories of R ( t ) were generatedusing the procedure discussed above. The resulting estimates of the parameters and effectivereproductive numbers, together with measures of uncertainty, are given in Table 2. Also listedare the initial parameter guesses given to the optimization routine and the minimized value ofthe cost function, J (ˆ θ OLS ). Figure 2(a) depicts the synthetic data (squares), together with thebest fitting model (solid curve). We remark that the observation noise, which is on the order of α = 0 .
50 times the smallest incidence value, represents a very small error over the major partof the synthetic data set. As such, it is almost impossible to distinguish between the data andfitted model in this figure.The trajectories of the effective reproductive number are shown as grey solid curves in Panels(a) and (b) of Figure 2, in which the trajectory R ( t ; ˆ θ OLS ) appears as a solid black curve. Again,the small errors make it difficult to distinguish between the central trajectory and the ensembleof R ( t ) trajectories.Figure 3 contains box plots of the R ( t ) samples at two fixed times: (a) t = 2 .
01, and (b) t = 11 .
2. We use the 2.5 and 97.5 percentiles of the R ( t ) sample distribution at time t toquantify uncertainty in the central estimate of R ( t ). At the bottom of Table 2 the estimatesof the effective reproductive number are summarized by showing the minimum and maximum(over time) of the central estimate of R ( t ) together with the uncertainty bounds obtained atthese two time points.A simple residuals analysis is illustrated in Figure 4. A residual at time t j is definedas y j − z ( t j ; ˆ θ OLS ). In Figure 4(a), these residuals are plotted against the predicted values, z ( t j ; ˆ θ OLS ). Figure 4(b) displays a plot of the residuals against time. No patterns or trendsare seen in these residual plots (for example, the magnitudes of the residuals show no trendsin either plot and the residuals do not exhibit any temporal patterns or correlations). This isto be expected because the u j are realizations of independent (uncorrelated) and identicallydistributed (standard normal) random variables. Time N u m b e r o f ca s e s Observed(synthetic data with noise)Predicted(best fit solution)
Time R ( t ) s a m p l e s Time R ( t ) s a m p l e s (a)(b) Figure 2: Results obtained by applying the OLS methodology to synthetic data with n = 1 , ,
000 of the m = 10 ,
000 effective reproductivenumber curves (solid gray) constructed using parameters drawn from the 4-multivariate normaldistribution N (ˆ θ OLS , ˆΣ nOLS ). The curve R ( t ; ˆ θ OLS ) is shown in solid black. The inset depicts aclose-up view of the curves for t in a small interval about t = 6 . .443.453.463.473.483.493.53.51 R ( t ) s a m p l e s R ( t ) s a m p l e s (a) (b) Figure 3: Variability in the samples of R ( t ) for two fixed values of t : (a) t=2 . t = 11 . m = 10 , R ( t ) samples(lower edge, middle and upper edge, respectively, of the solid box), together with the 2.5 and 97.5percentiles (lower and upper whiskers). Samples in the lower and upper 2.5 percentiles are shownas crosses. Arrows depict the locations of the corresponding central estimates R ( t ; ˆ θ OLS ).17able 2: Estimates obtained using synthetic data with constant variance noise ( α = 0 .
50, seetext for further details). The number of observations is n = 1 , R ( t ) sample sizeis m = 10 , θ = 1 . θ ,where θ denotes the true parameter. The true value, initial guess, estimate, and standard error,are given for all parameters, along with the value of the cost function evaluated at the parameterestimate. The minimum (Min.) and maximum (Max.) of the central estimate of the effectivereproductive number ( R ( t ; ˆ θ OLS )) are given with the accompanying 2.5 and 97.5 percentiles (insquare brackets).Parameter True value Initial guess Estimate Standard error S × × × × I × × × × − ˜ β × − × − × − . × − γ × − × − × − × − J (ˆ θ OLS ) = 1 . × σ = 1 . × ˆ σ OLS = 1 . × Min. R ( t ; ˆ θ OLS ) 0.138 [0.137,0.140]Max. R ( t ; ˆ θ OLS ) 3.494 [3.478,3.509] R e s i du a l R e s i du a l (a) (b) Figure 4: Analysis of the residuals from the OLS estimation applied to the synthetic data. Panel(a) depicts the residuals y j − z ( t j ; ˆ θ OLS ) versus the model values z ( t j ; ˆ θ OLS ) for j = 1 , . . . , n . Panel(b) displays the residuals versus time t j for j = 1 , . . . , n .18 n this example, the OLS methodology performs well, yielding excellent estimates of thetrue parameter value. This should not be surprising because the noise level was chosen to beextremely small and we provided the optimization routine with an initial parameter value thatwas close to the true value.The application of the OLS methodology does not always go so smoothly as in the previousexample: parameter estimates can be obtained that are far from the true values. The costfunction, J ( θ ), typically possesses multiple minima and the simple-minded use of fminsearch can yield a parameter estimate located at one of the other (local) minima, particularly whenthe initial parameter estimate is some distance away from the true value.Table 3 presents estimates for the same synthetic data set, but for which the initial parameterestimate was taken to be one hundred and seventy five percent ( θ = 2 . θ ) away from the trueparameter value. This results in poor estimates of the parameters: the values of S , ˜ β and γ areoverestimated by 317, 42, and 1730 percent, respectively, while the value of I is underestimatedby 78 percent. Worryingly, the values of the standard errors give no warning that the parameterestimates are quite so poor: the largest standard error, relative to the parameter estimate, isobtained for γ and equals 14%, while these figures fall to 12%, 10% and less than 1% for theremaining parameters.The true maximum value of R ( t ) is 3.500, yet the effective reproductive number has anestimated upper bound of 1.131; clearly the misleading estimates of S and γ cause R ( t ) to beunderestimated. If we did not know the true value of the effective reproductive number, wewould be unlikely to anticipate this underestimation, because the estimate and percentiles inthis case, [1 . , . R ( t ) samples are no longer normally distributed about the central estimate(Figure 6).Because we know the true parameter value and the outcome of a successful model fit to thisdata set, it was easy for us to identify the problems that arose here. We note, for instance,that the value of the cost function for the estimated parameter value is two orders of magnitudelarger than in the previous case, quantifying that the model fit is much worse. We would nothave the luxury of these pieces of information if this estimation arose in the consideration of areal-world data set. The residuals plots, however, clearly suggest that there are serious problemswith the model fit. In particular, there are obvious temporal trends in the residuals, indicatingsystematic deviations between the fitted model and data. Even though the observation noise is n = 1 ,
000 observations) with constant variancenoise ( α = 0 .
50) using θ = 2 . θ as the initial guess in the optimization algorithm. The samplesize for R ( t ) is m = 10 , S × × × × I × × × × ˜ β × − × − × − × − γ × − × × × J (ˆ θ OLS ) = 5 . × σ = 1 . × ˆ σ OLS = 5 . × Min. R ( t ; ˆ θ OLS ) 0.879 [0.837,0.904]Max. R ( t ; ˆ θ OLS ) 1.131 [1.102,1.182] small, it is just possible to see these deviations in Figure 5(a), but they are considerably easierto spot in the residuals plots of Figures 5(b) and (c). Time N u m b e r o f ca s e s Observed(synthetic data with noise)Predicted(best fit solution) (a) R e s i du a l R e s i du a l (b) (c) Figure 5: Results obtained using θ = 2 . θ as the initial guess in the optimization algorithm,applying the OLS methodology to synthetic data with n = 1 ,
000 and α = 0 .
50. Panel (a) displaysthe model prediction (solid curve), and the observations (solid squares), respectively. Panel (b)displays the residuals against the model values and panel (c) displays the residuals versus time.21
Time R ( t ) s a m p l e s Figure 6: One thousand of the m = 10 ,
000 effective reproductive number curves (solid gray)constructed using parameters drawn from the 4-multivariate normal distribution N (ˆ θ OLS , ˆΣ nOLS ).The curve R ( t ; ˆ θ OLS ) is shown in solid black. The curve of the median value of the R ( t ) samples,at each t , is also shown as a dashed black curve, but is indistinguishable from the curve R ( t ; ˆ θ OLS ).22 .2 Synthetic Data with Non-constant Variance Noise
We generated a second synthetic data set with non-constant variance noise. The true value θ was fixed, and was used to calculate the numerical solution z ( t j ; θ ). Observations werecomputed in the following fashion: Y j = z ( t j ; θ ) (1 + αV j ) , (39)where the V j are independent random variables with standard normal distribution, and 0 < α < var ( Y j ) = [ z ( t j ; θ ) α ] which is non-constant acrossthe time points t j . If the terms { v j } nj =1 denote a realization of { V j } nj =1 , then a realization ofthe observation process is denoted by y j = z ( t j ; θ )(1 + αv j ).An n = 1 ,
000 point synthetic data set was constructed with α = 0 . θ = 1 . θ . The weights in the normal equationsdefined by Equation (30), were chosen as w j = 1 /z ( t j ; θ ) .Table 4 lists estimates of the parameters and R ( t ), together with uncertainty estimates. Inthe case of R ( t ), uncertainty was assessed based on the simulation approach using m = 10 , N (ˆ θ GLS , ˆΣ nGLS ). Figure 7(a) depicts both dataand fitted model points, z ( t j ; ˆ θ GLS ), plotted versus t j . Figure 7(b) depicts 1 ,
000 of the 10 , R ( t ) curves. Table 4: Estimates from a synthetic data of size n = 1 , α = 0 . R ( t ) sample size is m = 10 , θ = 1 . θ . Each weight in the cost function L ( θ ) (see Equation (31)) was equal to 1 /z ( t j ; θ ) for j = 1 , . . . , n .Parameter True value Initial guess Estimate Standard error S × × × × I × × × × ˜ β × − × − × − × − γ × − × − × − × − L (ˆ θ GLS ) = 5 . × σ = 5 . × − ˆ σ GLS = 5 . × − Min. R ( t ; ˆ θ GLS ) 0.132 [0.120,0.146]Max. R ( t ; ˆ θ GLS ) 3.576 [3.420,3.753]23 esiduals plots are displayed in Figures 7(c) and (d). Because αv j = ( y j − z ( t j ; θ )) /z ( t j ; θ ),by construction of the synthetic data, the residuals analysis focuses on the ratios y j − z ( t j ; ˆ θ GLS ) z ( t j ; ˆ θ GLS )which in the labels of Figures 7(c) and (d) are referred to as “Modified residuals”. In Figure7(c) these ratios are plotted against z ( t j ; ˆ θ GLS ), while Panel (d) displays them versus the timepoints t j . The lack of any discernable patterns or trends in Figure 7(c) and (d) confirms thatthe errors in the synthetic data set conform to the assumptions made in the formulation of thestatistical model of Equation (39). In particular, the errors are uncorrelated and have variancethat scales according to the relationship stated above. Time N u m b e r o f ca s e s Observed(synthetic data with noise)Predicted(best fit solution)
Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l M od i f i e d r e s i du a l (c) (d) Figure 7: Results from applying the GLS methodology to synthetic data with non-constant variancenoise ( α = 0 . n = 1 ,
000 observations. The initial guess for the optimization routine was θ = 1 . θ . The weights in the cost function were equal to 1 /z ( t j ; θ ) , for j = 1 , . . . , n . Panel (a)depicts the observed and fitted values and panel (b) displays 1 ,
000 of the m = 10 , R ( t ) sampletrajectories. Residuals plots are presented in panels (c) and (d): modified residuals versus fittedvalues in (c) and modified residuals versus time in (d).25 Analysis of Influenza Outbreak Data
The OLS and GLS methodologies were applied to longitudinal observations of six influenzaoutbreaks (see Section 2), giving estimates of the parameters and the reproductive number foreach season. The number of observations n varies from season to season. The R ( t ) sample sizewas m = 10 ,
000 in each case. The set of admissible parameters Θ is defined by the lower andupper bounds listed in Table 5 along with the inequality constraint S ˜ β/γ >
1. The boundsin Table 5 were obtained or based on [8, 23, 25] and references therein. For brevity, we onlypresent here the results obtained using data from the 1989-1999 season.
Table 5: Lower and upper bounds on the initial conditions and parameters.Suitable Range Unit1.00 × < S < × people0.00 < I < × people7.00 × − < ˜ β < × − weeks − people − / < /γ < In most cases, visual comparison of the trajectory of the best fitting model obtained using OLSand the data points suggests that a good fit has been achieved (Figure 8(a)). The statisticsthat quantify the uncertainty in the estimated values of the parameters, however, indicate thatthis may not always be the case. In many cases, the standard errors are of the same order ofmagnitude as the parameter values themselves, indicating wide error bounds (see Table 6) andsuggesting a lack of confidence in the estimates.We should, however, interpret the statistical results with some caution because the residualsplots (Figure 8(c) and (d)) show clear patterns, indicating that the assumptions of the statisticalmodel may have been violated. For instance, the variance of the residuals appears to increasewith the predicted value. There are definite patterns visible in the residuals versus time plot.The temporal correlation of the errors could represent some inadequacy in the way that thedynamic model describes the epidemic process, or an inadequacy in the data set itself.Another indication of problems in the estimation process comes from the matrix χ (ˆ θ OLS , n ) T χ (ˆ θ OLS , n ). The condition number of this matrix is 9 . × , indicating that S × × I × − × − ˜ β × − × − γ × × J (ˆ θ OLS ) = 6 . × ˆ σ OLS = 2 . × Min. R ( t ; ˆ θ OLS ) 0.752 [0.752,0.824]Max. R ( t ; ˆ θ OLS ) 1.299 [1.202,1.300] the matrix is close to singular. Calculation of the covariance matrix requires the inversion ofthis matrix, and, as mentioned above, the asymptotic theory requires that the matrix Ω , de-fined as a limit of matrix products of this form, has a non singular limit as n → ∞ . A nearlysingular matrix can arise when there is redundancy in the data or when there are problems withparameter identifiability [3, 4].It is interesting to observe that the estimated values of γ frequently fall on the boundaryof the feasible region. This may impact the uncertainty analysis, given that the conditions ofthe asymptotic theory require that the true parameter value lies in the interior of the feasibleregion. If our estimates commonly fall on the boundary, this could be an indication that thetrue parameter value may not lie within our feasible region. We return to this issue below, inSection 6.4. Time N u m b e r o f ca s e s Time R ( t ) s a m p l e s (a)(b) R e s i du a l R e s i du a l (c) (d) Figure 8: OLS model fits to influenza data from season 1998-1999. Panel (a) depicts the observa-tions (solid squares) and the model prediction (solid curve), respectively. In Panel (b) the samplesof the effective reproductive number R ( t ) are displayed (grey curves) together with the centralestimate R ( t ; ˆ θ OLS ) (solid black curve). The dashed black curve depicts the median, at each timepoint, of the distribution of the R ( t ) samples. Panel (c) contains the residuals plotted versus themodel prediction. In Panel (d) residuals are plotted against time.28 .2 GLS Estimation Visual inspection suggests that the model fits obtained using the GLS approach (Figure 9) areeven worse than those obtained using OLS. This is somewhat misleading, however, becausethe weights, defined as w j = 1 / [ z ( t j ; θ )] , mean that the GLS fitting procedure (unlike visualinspection of the figures) places increased emphasis on datapoints whose model value is smalland decreased emphasis on datapoints where the model value is large. If these graphs are,instead, plotted with a logarithmic scale on the vertical axis, an accurate visualization is obtained(Figure 10): multiplicative observation noise on a linear scale becomes constant variance additiveobservation noise on a logarithmic scale.As before, however, the parameter estimates have standard errors that are often of thesame order of magnitude as the estimates themselves (Table 7). The residuals plots revealclear patterns and trends (Figure 9(c) and (d)). Temporal trends in the residuals (and visualinspection of the plots depicting the best fitting model and the datapoints) indicate that thereare systematic differences between the fitted model and the data. For instance, it appearsthat the fitted model peaks slightly earlier than the observed outbreak, and, as a result, thereare numbers of sequential points where the data lies above or below the model. The modifiedresiduals versus model plot suggests that the variation of the residuals may be decreasing as themodel value increases.The condition number of the matrix χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n ) is 9 . × . Thisis very similar to that for the OLS estimation, again suggesting caution in interpreting thestandard errors.The modified residuals versus model plot indicates that the 1 /z ( t j ; ˆ θ GLS ) weights may haveovercompensated for the non-constant variance seen in the OLS residuals. This suggests thatit may be appropriate to use weights that vary in a milder fashion, such as 1 /z ( t j ; ˆ θ GLS ).The model fits that result with these new weights appear, by visual inspection, to provide amore satisfactory fit to the data (Figure 11). Standard errors for the parameter estimates arestill large, however (Table 8). Our earlier comment concerning the difficulty of assessing theadequacy of GLS model fits by visual inspection should be borne in mind— see Figure 12 for amore accurate depiction in which square roots of the quantities are plotted so as to transform theerrors in which variance scales with the model value into additive errors with constant variance.The new modified residuals plots, which now focus on the quantities y j − z ( t j ; ˆ θ GLS ) z ( t j ; ˆ θ GLS ) / , /z ( t j ; θ ) . Parameter Estimate Standard error S × × I × − × − ˜ β × − × − γ × × L (ˆ θ GLS ) =1.754 × ˆ σ GLS = 6 . × − Min. R ( t ; ˆ θ GLS ) 0.843 [0.784,1.018]Max. R ( t ; ˆ θ GLS ) 1.177 [1.052,1.252]Table 8: Results of GLS estimation applied to the 1998-1999 season influenza data. Weights takento equal 1 /z ( t j ; θ ). Parameter Estimate Standard error S × × I × − × − ˜ β × − × − γ × × L (ˆ θ GLS ) =2.335 × ˆ σ GLS = 8 . × Min. R ( t ; ˆ θ GLS ) 0.810 [0.754,0.828]Max. R ( t ; ˆ θ GLS ) 1.218 [1.200,1.297] also appear to exhibit less marked patterns than they did for either OLS or GLS with the 1 /z weights (Figures 11(c) and (d)). The condition number for the matrix χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n )is 1 . × , again suggesting ill-posedness and that the standard errors should be interpretedwith caution.A handful of surprisingly large modified residuals are seen on many occasions, although theseoften do not appear on our plots because we choose the range on the residuals axis so that themajority of the points can be seen most clearly. The locations of these residuals is noted onthe figure caption; we see that they occur during the initial part of the time series, when thenumbers of cases are low. Time N u m b e r o f ca s e s Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l M od i f i e d r e s i du a l (c) (d) Figure 9: GLS applied to influenza data from season 1998-1999. The weights were taken equalto 1 /z ( t j ; θ ) . Panel (a) depicts the observations (solid squares) as well as the model prediction(solid curve). In Panel (b) 1 ,
000 of the m = 10 ,
000 samples of the effective reproductive number R ( t ) are displayed. The solid curve depicts the central estimate R ( t ; ˆ θ GLS ) and the dashed curvethe median of the R ( t ) samples at each point in time. Panel (c) exhibits the modified residuals( y j − z ( t j ; ˆ θ GLS )) /z ( t j ; ˆ θ GLS ) plotted versus the model predictions, z ( t j ; ˆ θ GLS ). Panel (d) displaysthe modified residuals plotted against time. 31
Time N u m b e r o f ca s e s Figure 10: Best fitting model for the 1998-1999 season, obtained using GLS with 1 /z ( t j ; θ ) weights.Observations (solid squares) and the model prediction (solid curve) are plotted on a logarithmicscale. 32 Time N u m b e r o f ca s e s Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l M od i f i e d r e s i du a l (c) (d) Figure 11: GLS estimation for influenza data from season 1998-1999. Weights equal 1 /z ( t j ; θ ).Panel (a) depicts the observations (solid squares) as well as the model prediction (solid curve).In Panel (b) 1 ,
000 of the m = 10 ,
000 samples of the effective reproductive number R ( t ) aredisplayed, together with the central estimate R ( t ; ˆ θ GLS (solid curve) and, at each time point,the median of the R ( t ) samples (dashed curve). Panel (c) presents the modified residuals ( y j − z ( t j ; ˆ θ GLS )) /z ( t j ; ˆ θ GLS ) / versus the model predictions. Panel (d) displays the modified residualsplotted against time. Three modified residuals fall outside the range shown on this graph: theirvalues (and the timepoints at which they arise) are -4.91 (at t = 1), -7.72 (at t = 2), and -9.50 (at t = 5). 33 Time ( N u m b e r o f ca s e s ) / Figure 12: Best fitting model for the 1998-1999 season, obtained using GLS with 1 /z ( t j ; θ ) weights.Square roots of observations (solid squares) and square roots of the model prediction (solid curve)are plotted. 34 .3 GLS Estimation Using Truncated Data Sets It is quite plausible that our description of the error structure of the data is inadequate whenthe numbers of cases are at low levels. For instance, the reporting process might change as theoutbreak starts to take hold (e.g., doctors become more alert to possible flu cases) or comes closeto ending. Also, our model is deterministic whereas a real-world epidemic contains stochasticity.Stochastic effects may exhibit a relatively large impact at the start or end of an epidemic, whenthe numbers of cases are low. It is possible for the infection to undergo extinction, a phenomenonwhich cannot be captured by the deterministic model. Spatial clustering of cases is also a distinctpossibility, particularly during the early stages of an outbreak. This will affect the time course ofan outbreak as well as the reporting process: clustering of cases may well increase the reportingnoise if cases in a cluster tend to get reported together (e.g., a cluster occurs within an areawhere many isolates are sent to the CDC) or not reported together (e.g., a cluster occurs in anarea that has poorer coverage in the reporting process).Indeed, examination of one of the influenza time series plotted on either a logarithmic orsquare root scale (Figures 10 or 12) indicates that both the start and end of the time series areproblematic. The fit of the model is clearly poorer over these parts of the time series, whichcorrespond to the times when the observed values are small.Both forms of the weights (inversely proportional to the square of the predicted incidence orinversely proportional to the predicted incidence) mean that errors at these small values haveconsiderable impact on the cost function, and hence on the GLS estimation process, althoughthis is less of a concern for the 1 /z weights.Another issue that has been raised by studies of parameter estimation in biological situationsconcerns redundancy in information measured when a system is close to its equilibrium [4]. Thismight be a relevant issue for the final part of the outbreak data as there is often a period lastingten or more weeks when there are few cases.We investigated whether the removal of the lowest valued points from the data sets wouldimprove the fitting process. We constructed truncated data sets by considering only the periodbetween the time when the number of isolates first reached ten at the start of the outbreak andfirst fell below ten at the end of the outbreak. As a notational convenience, we refer to thenumbers of susceptibles and infectives at the start of the first week of the truncated data set as S and I , even though these times no longer correspond to the start of the influenza season.(For example, in Figures 13 and 14, S and I refer to the state of the system at t = 8.)Comparing Tables 7 and 9, which arise from GLS estimation with 1 /z weights, we see /z ( t j ; θ ) , applied to truncated influenza dataset for season 1998-1999. Parameter Estimate Standard error S × × I × × ˜ β × − × − γ × × L (ˆ θ GLS ) =9.475 × − ˆ σ GLS = 5 . × − Min. R ( t ; ˆ θ GLS ) 0.808 [0.745,0.820]Max. R ( t ; ˆ θ GLS ) 1.223 [1.211,1.311] that the standard errors for the parameter estimates have decreased. This decrease occurs eventhough the number of points in the data set has fallen from 35 to 23, causing the factor 1 / ( n − χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n ) is 2 . × .A similar result is seen in the 1 /z weights case. We remark that we no longer have theextreme outlier residuals. The condition number of the matrix χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n )is 9 . × .Truncating the data sets has helped considerably with the GLS estimation process, althoughthe large condition numbers still are cause for caution with the standard errors.Truncation of the data set had little effect on the parameter estimates obtained using OLS(results not shown), except that the values of S and I were changed because they refer to alater initial time, as discussed above. Standard errors for the OLS estimates were higher thanfor the full data set, as should be expected given the reduced number of data points. Time N u m b e r o f ca s e s
10 15 20 25 30
Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l
10 15 20 25 30Time-0.6-0.4-0.200.20.4 M od i f i e d r e s i du a l (c) (d) Figure 13: Model fits obtained using GLS on truncated influenza data from season 1998-1999,weights equal to 1 /z ( t j ; θ ) . Panel (a) depicts the observations (solid squares) as well as the modelprediction (solid curve). Panel (b) displays 1 ,
000 of the m = 10 ,
000 samples of the effectivereproductive number, together with the central estimate R ( t ; ˆ θ GLS ) (solid curve), and the medianof the R ( t ) samples at each point in time (dashed curve). Panel (c) displays the modified residualsversus the model predictions. In Panel (d) modified residuals are plotted against time.37 Time N u m b e r o f ca s e s
10 15 20 25 30
Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l
10 15 20 25 30Time-3-2-10123 M od i f i e d r e s i du a l (c) (d) Figure 14: Model fits obtained using GLS on truncated influenza data from season 1998-1999,weights equal to 1 /z ( t j ; θ ). Panel (a) shows the observations (solid squares) as well as the modelprediction (solid curve). In Panel (b) 1 ,
000 of the m = 10 ,
000 samples of the effective reproductivenumber R ( t ) are displayed together with the central estimate R ( t ; ˆ θ GLS ) (solid curve) and themedian of the R ( t ) samples at each time point (dashed curve). Panel (c) shows the modifiedresiduals versus the model prediction. In panel (d), each modified residual is displayed versus theobservation time point. 38able 10: Estimation results from GLS, with weights 1 /z ( t j ; θ ), applied to truncated influenza dataset for season 1998-1999. Parameter Estimate Standard error S × × I × × − ˜ β × − × − γ × × L (ˆ θ GLS ) =3.872 × ˆ σ GLS = 2 . × Min. R ( t ; ˆ θ GLS ) 0.750 [0.748,0.819]Max. R ( t ; ˆ θ GLS ) 1.306 [1.212,1.308]39able 11: Estimation of three epidemiological parameters. Results obtained by applying OLS totruncated influenza data set from season 1998-1999.Parameter Estimate Standard error S × × I × × − ˜ β × − × − J (ˆ θ OLS ) =6.131 × ˆ σ OLS = 3 . × Min. R ( t ; ˆ θ OLS ) 0.754 [0.744,0.787]Max. R ( t ; ˆ θ OLS ) 1.299 [1.258,1.314]
The preceding results indicate that there are difficulties in estimating the parameter γ , aswitnessed by the number of situations in which the estimate lies on the boundary of the feasibleparameter region. Because γ is the one parameter for which we can obtain reasonably reliableestimates without the need to fit a model to an incidence time series [8, 23], we fix its value andinvestigate estimation for a reduced three parameter problem. In all of what follows, we applythe estimation methodology to the truncated data sets, as discussed in the previous section.We use a fixed infectious period of four days, i.e., 1 /γ = 4 days = 4 / θ = ( S , I , ˜ β ) using the OLS approach and the GLS approach with weights w j = 1 / [ z ( t j ; θ )] or w j = 1 / [ z ( t j ; θ )].Estimation for the reduced parameter set leads to model fits that are not so different fromthose obtained using the full ( p = 4) set of parameters in θ . For example, for the truncateddata set from the 98-99 season, with weights equal to 1 /z ( t j ; θ ), we have L (ˆ θ GLS ) = 38 .
72 forthe full parameter set (Table 10) while L (ˆ θ GLS ) = 38 .
72 for the reduced parameter set (Table13). The standard errors of the parameters, however, are smaller for the reduced parameter set:the three standard errors for the estimates of S , I and ˜ β are 3 . × , 9 . × − and1 . × − , respectively, for the full parameter set, while they are 2 . × , 3 . × − and 1 . × − for the reduced set.The condition numbers of the matrices χ (ˆ θ GLS , n ) T W (ˆ θ GLS ) χ (ˆ θ GLS , n ) are 4 . × and4 . × for the 1 /z and 1 /z weights, respectively. The increased precision of the estimateshere likely results from identifiability issues in the estimation problem for the full set of modelparameters. Time N u m b e r o f ca s e s
10 15 20 25 30
Time R ( t ) s a m p l e s (a)(b) R e s i du a l
10 15 20 25 30Time-50-2502550 R e s i du a l (c) (d) Figure 15: Estimation of three epidemiological parameters using OLS on a truncated data setfrom the 1998-1999 influenza season. Panel (a) depicts the observations (solid squares) as wellas the model prediction (solid curve). In Panel (b) 1 ,
000 of the m = 10 ,
000 samples of theeffective reproductive number R ( t ) are displayed, together with the central estimate R ( t ; ˆ θ OLS )(solid curve) and the median of the R ( t ) samples at each time point (dashed curve). Panel (c)exhibits the residuals y j − z ( t j ; ˆ θ OLS ) versus the model predictions z ( t j ; ˆ θ OLS ). In Panel (d) eachresidual is displayed versus the observation time point t j = j , for j = 1 . . . , n .41 Time N u m b e r o f ca s e s
10 15 20 25 30
Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l
10 15 20 25 30Time-0.75-0.5-0.2500.250.5 M od i f i e d r e s i du a l (c) (d) Figure 16: Estimation of three epidemiological parameters, using truncated data from influenzaseason 1998-1999 and GLS with each weight equal to 1 /z ( t j ; θ ) for j = 1 , . . . , n . Panel (a) depictsthe observations (solid squares) as well as the model prediction (solid curve). In Panel (b) 1 , m = 10 ,
000 samples of the effective reproductive number R ( t ) are displayed. The solidcurve depicts the central estimate R ( t ; ˆ θ GLS ) and, at each time point, the dashed curve depicts themedian of the R ( t ) samples. Panel (c) exhibits the modified residuals ( y j − z ( t j ; ˆ θ GLS )) /z ( t j ; ˆ θ GLS )versus the model predictions z ( t j ; ˆ θ GLS ). In Panel (d) each modified residual is displayed versusthe observation time point t j = j , for j = 1 . . . , n .42 Time N u m b e r o f ca s e s
10 15 20 25 30
Time R ( t ) s a m p l e s (a)(b) M od i f i e d r e s i du a l
10 15 20 25 30Time-3-2-10123 M od i f i e d r e s i du a l (c) (d) Figure 17: Estimation of three epidemiological parameters using truncated influenza data fromseason 1998-1999. GLS was used, with each weight equal to 1 /z ( t j ; θ ) for j = 1 , . . . , n . Panel (a)depicts the observations (solid squares) as well as the model prediction (solid curve). In Panel(b) 1 ,
000 of the m = 10 ,
000 samples of the effective reproductive number R ( t ) are displayed.Also shown are the central estimate R ( t ; ˆ θ GLS ) (solid curve) together with the median of the R ( t )samples (dashed curve). Panel (c) exhibits the modified residuals ( y j − z ( t j ; ˆ θ GLS )) /z ( t j ; ˆ θ GLS ) versus the model predictions z ( t j ; ˆ θ GLS ). In Panel (d) each modified residual is displayed versusthe observation time point t j = j , for j = 1 . . . , n .43able 12: Estimation of three epidemiological parameters. Results obtained by applying GLS, withweights equal to 1 /z ( t j ; θ ) , to truncated influenza data from season 1998-1999.Parameter Estimate Standard error S . × . × I . × × − ˜ β × − . × − L (ˆ θ GLS ) = 9 . × − ˆ σ GLS = 5 . × − Min. R ( t ; ˆ θ GLS ) 0.752 [0.740,0.774]Max. R ( t ; ˆ θ GLS ) 1.302 [1.274,1.320]Table 13: Estimation of three epidemiological parameters. Results obtained by applying GLS, withweights equal to 1 /z ( t j ; θ ), to the truncated influenza data set from season 1998-1999.Parameter Estimate Standard error S × × I . × . × − ˜ β × − × − L (ˆ θ GLS ) =3.872 × ˆ σ GLS = 2 . × Min. R ( t ; ˆ θ GLS ) 0.750 [0.739,0.767]Max. R ( t ; ˆ θ GLS ) 1.306 [1.283,1.321]44
Discussion
We have presented parameter estimation methodologies that, using sensitivity analysis andasymptotic statistical theory, also provide measures of uncertainty for the estimated parameters.The techniques were illustrated using synthetic data sets, and it was seen that they can performvery well with reasonable data sets. Even within the ideal situation provided by syntheticdata, potential problems of the approach were identified. Worringly, these problems were notapparent from inspection of the uncertainty estimates (standard errors) alone. However, theseproblems were revealed by examination of model fit diagnostic plots, constructed in terms ofthe residuals of the fitted model. These results argue strongly for the routine use of uncertaintyestimation, together with careful examination of residuals plots when using SIR-type modelswith surveillance data.The statistical methodology presented here only addresses the impact of observation erroron parameter estimation. While the approach can handle different statistical models for theobservation process, it does assume that we have a model that correctly describes the behaviorof the system, albeit for an unknown value of the parameter vector. The methodology does notexamine the effect of mis-specification of the model. It is well-known that this effect can dwarfthe uncertainty that arises from observation error [24]. Examination of residuals plots, however,can identify systematic deviations between the behavior of the model and the data.Application of the least squares approaches to the influenza isolate data gave mixed results.Estimates of the effective reproductive number were in broad agreement with results obtained inother studies (see Table 14). While apparently reasonable fits were obtained in some instances,the uncertainty analyses highlighted situations in which visual inspection suggested that a goodfit had been obtained but for which estimated parameters had large uncertainties. Residualsplots showed that error variance may not have been constant (i.e., observation noise was notsimply additive), but more likely scaled according to either the square of the fitted value (i.e.,relative measurement error) or the fitted value itself. The potentially large impact of errors atlow numbers of cases on the GLS estimation process was clearly observed.Temporal trends were observed in the residuals plots, indicative of systematic differencesbetween the behavior of the SIR model and the data. Potential sources of these differencesinclude inadequacies of the model to describe the process underlying the data and issues withthe reliability of the data itself, particularly in the light of the health warning attached to thedata by the CDC. (We emphasize, however, that our use of these data sets should be seen asonly an illustration of the approach.) R stands for the basic reproductive number (naive population), whilemax( R ( t )) denotes the initial effective reproductive number in a non-naive population.Studies of interpandemic influenza EstimatesBonabeau et al. [5] 1 . ≤ R ≤ . . ≤ max( R ( t )) ≤ . . ≤ R ≤ . R = 1 . . ≤ R ≤ . . ≤ max( R ( t )) ≤ . Sophisticated mathematical and statistical algorithms and analyses can be utilized to fit SIR-type epidemiological models to surveillance data. Good quality data is required if this approachis to be successful. In many instances, however, the available surveillance data is most likelyinadequate to validate the SIR model with any degree of confidence. This is likely to be true inmuch of the modeling efforts for epidemics where the data collection process has inadequacies.
This material was based upon work supported by the Statistical and Applied MathematicalSciences Institute (SAMSI postdoctoral fellowship to A. C.-A.) which is funded by the Na-tional Science Foundation under Agreement No. DMS-0112069. Any opinions, findings, andconclusions or recommendations expressed in this material are those of the authors and do notnecessarily reflect the views of the National Science Foundation.
References [1] R. Anderson and R. May,
Infectious Diseases of Humans: Dynamics and Control , OxfordUniversity Press, 1991.[2] P. Bai, H. T. Banks, S. Dediu, A. Y. Govan, M. Last, A. L. Lloyd, H. K. Nguyen, M. S.Olufsen, G. Rempala, and B. D. Slenning, Stochastic and deterministic models for agricul-tural production networks,
Math. Biosci. Eng. (2007), 373-402.[3] H. T. Banks, M. Davidian, J. R. Samuels, Jr., and K. L. Sutton An inverse problem statisti-cal methodology summary. Center for Research in Scientific Computation Technical Report RSC-TR08-1, NCSU, Raleigh NC. 2008; in
Mathematical and Statistical Estimation Ap-proaches in Epidemiology , (G. Chowell, et. al., eds.), Springer, New York, to appear.[4] H. T. Banks, S. L. Ernstberger and S. L. Grove, Standard errors and confidence intervalsin inverse problems: Sensitivity and associated pitfalls. Center for Research in ScientificComputation. Technical Report CRSC-TR06-10, NCSU, Raleigh NC. 2006;
J. Inv. Ill-posedProblems (2006), 1-18.[5] E. Bonabeau, L. Toubiana, and A. Flahault, The geographical spread of influenza, Proc. R.Soc. Lond. B (1998), 2421-2425.[6] F. Brauer and C. Castillo-Ch´avez,
Mathematical Models in Population Biology and Epidemi-ology , Springer, New York, 2001.[7] R. J. Carroll, C. F. Wu, and D. Ruppert, The effect of estimating weights in weighted leastsquares,
J. Am. Stat. Assoc. (1988), 1045-1054.[8] S. Cauchemez, F. Carrat, C. Viboud, A. J. Valleron, and P. Y. Boelle, A Bayesian MCMCapproach to study transmission of influenza: application to household longitudinal data, Stat. Med. Epidemiol. Infect. (2008), 852-864.[11] R. B. Couch and J. A. Kasel, Immunity to influenza in man,
Ann. Rev. Microbiol. (1983), 529-549.[12] J. B. Cruz, Jr., ed., System Sensitivity Analysis , Dowden, Hutchinson & Ross, Inc., Strouds-berg, PA, 1973.[13] M. Davidian and D. M. Giltinan,
Nonlinear Models for Repeated Measurement Data , Chap-man & Hall, Boca Raton, 1995.[14] K. Dietz, The estimation of the basic reproduction number for infectious diseases,
Stat.Methods Med. Res. (1993), 23-41.[15] J. Dushoff, J. B. Plotkin, S. A. Levin, and D. J. D. Earn, Dynamical resonance can accountfor seasonality of influenza epidemics, Proc. Natl. Acad. Sci. USA (2004), 16915-16916.
16] M. Eslami,
Theory of Sensitivity in Dynamic Systems: An Introduction , Springer-Verlag,New York, NY, 1994.[17] N. M. Ferguson, A. P. Galvani, and R. M. Bush, Ecological and immunological determinantsof influenza evolution,
Nature (2003), 428-433.[18] A. Flahault, S. Letrait, P. Blin, S. Hazout, J. Menares, and A. J. Valleron, Modeling the1985 influenza epidemic in France,
Stat. Med. (1988), 1147-1155.[19] P. M. Frank, Introduction to System Sensitivity Theory , Academic Press, New York, NY,1978.[20] H. Hethcote, The mathematics of infectious diseases,
SIAM Rev. (2000), 599-653.[21] M. Kleiber, H. Antunez, T. D. Hien, and P. Kowalczyk, Parameter Sensitivity in NonlinearMechanics: Theory and Finite Element Computations , John Wiley & Sons, Chichester, 1997.[22] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, Convergence properties ofthe Nelder-Mead simplex method in low dimensions,
SIAM J. Optimiz. (1999), 112-147.[23] I. M. Longini, J. S. Koopman, A. S. Monto, and J. P. Fox, Estimating household andcommunity transmission parameters for influenza, Am. J. Epidemiol. (1982), 736-751.[24] A. L. Lloyd, The dependence of viral parameter estimates on the assumed viral life cycle:limitations of studies of viral load data,
Proc. R. Soc. Lond. B. (2001), 847-854.[25] M. Nuno, G. Chowell, X. Wang, C. Castillo-Ch´avez, On the role of cross-immunity andvaccination in the survival of less-fit flu strains,
Theor. Pop. Biol. (2007), 20-29.[26] A. Saltelli, K. Chan, and E. M. Scott, eds., Sensitivity Analysis , John Wiley & Sons,Chichester, 2000.[27] G. A. F. Seber and C. J. Wild,
Nonlinear Regression , John Wiley & Sons, Chichester, 2003.[28] C. C. Spicer and C. J. Lawrence, Epidemic influenza in Greater London,
J. Hyg. Camb. (1984), 105-112.[29] W. Thompson, D. Shay, E. Weintraub, L. Brammer, N. Cox, L. Anderson, and K. Fukuda,Mortality associated with influenza and respiratory syncytial virus in the United States, JAMA
31] C. Viboud, T. Tam, D. Fleming, A. Handel, M. Miller, and L. Simonsen, Transmissibilityand mortality impact of epidemic and pandemic influenza, with emphasis on the unusuallydeadly 1951 epidemic,
Vaccine (2006), 6701-6707.(2006), 6701-6707.