A robust nonlinear mixed-effects model for COVID-19 deaths data
Fernanda L. Schumacher, Clecio S. Ferreira, Marcos O. Prates, Alberto Lachos, Victor H. Lachos
AA ROBUST NONLINEAR MIXED - EFFECTS MODEL FOR
COVID-19
DEATHS DATA
Fernanda L. Schumacher
Department of StatisticsUniversidade Estadual de CampinasCampinas, SP Brazil [email protected]
Clecio S. Ferreira
Department of StatisticsUniversidade Federal de Juiz de ForaJuiz de Fora, MG Brazil [email protected]
Marcos O. Prates
Department of StatisticsUniversidad Federal de Minas GeraisMinas Gerai, MG, Brazil [email protected]
Alberto Lachos
National Institute of Neoplastic DiseasesINENLima, Peru [email protected]
Victor H. Lachos
Department of StatisticsUniversity of ConnecticutStorrs, CT 06250 [email protected]
August 4, 2020 A BSTRACT
The analysis of complex longitudinal data such as COVID-19 deaths is challenging due to sev-eral inherent features: (i) Similarly-shaped profiles with different decay patterns; (ii) Unexplainedvariation among repeated measurements within each country, these repeated measurements may beviewed as clustered data since they are taken on the same country at roughly the same time; (iii)Skewness, outliers or skew-heavy-tailed noises are possibly embodied within response variables.This article formulates a robust nonlinear mixed-effects model based in the class of scale mixturesof skew-normal distributions for modeling COVID-19 deaths, which allows the analysts to modelsuch data in the presence of the above described features simultaneously. An efficient EM-typealgorithm is proposed to carry out maximum likelihood estimation of model parameters. The boot-strap method is used to determine inherent characteristics of the nonlinear individual profiles suchas confidence interval of the predicted deaths and fitted curves. The target is to model COVID-19deaths curves from some Latin American countries since this region is the new epicenter of the dis-ease. Moreover, since a mixed-effect framework borrows information from the population-averageeffects, in our analysis we include some countries from Europe and North America that are in a moreadvanced stage of their COVID-19 deaths curve. K eywords COVID-19 deaths data · Nonlinear mixed-effects models · Scale mixtures of skew-normal distributions.
The world is facing a global pandemic of coronavirus disease (COVID-19), caused by severe acute respiratory syn-drome coronavirus 2 (SARS-CoV-2). Initiated at the end of 2019 in Wuhan, China, we are coming, on the 26th ofJune of 2020, to more than 9 millions cases and 500 thousands deaths spread in 216 countries. The World HealthOrganization (WHO) has made a joint effort with nations to tackle the disease. Some countries as Germany, Italy,Spain and United Kingdom faced the peak of the disease in March 2020, and now appear to have the disease un-der control. In other regions such as South-East Asia and Africa it seems to be at the beginning (panel at WHO, https://covid19.who.int/ ). Currently, Latin American countries are the new epicenter of the disease.Some institutions around the world have dedicated to data collection and graphical analysis. For example, the panelof the WHO presents graphs of the number of cases and deaths over time for regions and countries. The Johns a r X i v : . [ s t a t . A P ] A ug PREPRINT - A
UGUST
4, 2020Hopkins University collects and makes available the daily data, besides producing maps of the occurrence by countries( https://coronavirus.jhu.edu/map.html ). Another available repository is the website Worldometers( ).There are many institutions in the world modeling and performing forecast of the number of cases and deaths ofCOVID-19. For example, at the beginning of March 2020, an study of the Oxford University projected more than thousand of deaths in Brazil ( ). Another study fromthe University of Washington’s Institute for Health Metrics and Evaluation (IHME) projects a total of thousand ofdeaths ( https://covid19.healthdata.org/projections ). The Imperial College presented an specificstudy to Brazil that estimates with high confidence that the reproduction number remained above , indicating that theepidemic was not yet controlled and would continue to grow (Mellan et al., 08-05-2020). In Brazil, there are someorganizations studying the evolution of the disease over time allowing online apps with daily updates, for instance,the Department of Statistics of the Federal University of Minas Gerais (CovidLPTeam, 2020) and the Department ofStatistics of the Federal University of Juiz de Fora (JF Salvando Todos Team, 2020).The Americas have become the epicenter of the global coronavirus outbreak, logging nearly million infections and , deaths. Brazil, Chile, Colombia, Mexico, and Peru have been particularly hard hit in recent weeks. To pointout the possible impact of the disease in Latin America, the CovidLPTeam (2020), in June th , estimates a total ofmore than thousand deaths only for these countries until the end of the pandemic.Some studies for modeling the COVID-19 data are based on nonlinear models. Tsallis and Tirnakli (2020) proposed anonlinear model based on volume of stock-markets to predict the number of active cases: N t = C ( t − t ) α [1 + ( q − β ( t − t ) γ ] / ( q − , where C > , α > , β > , γ > , q > and t ≥ . The constant t indicates the first day of appearance of theepidemic in that particular region; it is conventionally chosen to be zero for China; for other countries, it is the numberof days elapsed between the appearance of the first case in China and the first case in that country. The normalizingconstant C reflects the total population of that particular country. We refer to Tsallis and Tirnakli (2020), for moredetails about the interpretation of the parameters α, β, γ, q .The CovidLPTeam (2020) proposed an hierarchical Bayesian non-Gaussian and non-linear model to capture the dy-namics of the epidemic (last accessed on the th of June on 2020). The daily counts are assumed to come from aPoisson distribution and the daily pandemic evolving by a generalized logistic dynamic: N t ∼ P oisson ( µ t ) ,µ t = d f cae − ct ( b + e − ct ) f +1 , (1)where d is responsible to capture subnotification on specific day(s), c control the infection rate, f is an asymmetryparameter, a , b and f control the asymptote of the curve, given by ab f , with the peak occurring at time t = − ln ( b/f ) c .However, to the best of our knowledge, most models developed up to this date are univariate, not taking into accounta possible structure of dependence on the disease among countries or regions. Also, as the onset of the disease occursat different periods between countries, those at a more advanced stage of the disease can provide valuable informationto countries at an early stage.In recent years, nonlinear mixed-effects (NLME) models have been proposed for modeling many complex longitudinaldata (Lindstrom and Bates, 1990; Wu, 2010). However, one often assumes that both random error and random effectsare normally distributed, which may not always give reliable results if the data exhibit excessive skewness and heavytailedness, as is the case of COVID-19 deaths data. In this paper, we present a novel class of asymmetric NLMEmodels that provides an efficient estimation of the parameters in the analysis of longitudinal data. We assume that,marginally, the random effects follow a scale mixtures of skew-normal distribution (SMSN, Branco and Dey, 2001)and that the random errors follow a symmetric scale mixtures of normal distribution (SMN, Lange and Sinsheimer,1993) providing an appealing robust alternative to the usual normal distribution in NLME models. We propose anapproximate likelihood analysis for maximum likelihood (ML) estimation via an EM-type algorithm that producesaccurate ML estimates and significantly reduces the numerical difficulty associated with the exact ML estimation.The newly approximate procedures are applied to COVID-19 deaths data and it is showed that models with skew-heavy-tailed assumption may provide more reasonable results and these may be important for COVID-19 research inproviding quantitative guidance to better understand the stages and future development of the disease.The purpose of this study is to predict the number of deaths caused by COVID-19 to short and long term, in countriesat early stage such as Peru, Mexico, Chile, Brazil and Colombia along with some countries at a more advanced stage2 PREPRINT - A
UGUST
4, 2020of the disease such as Belgium, Italy, the USA and the United Kingdom. Furthermore, a general bootstrap method isused for constructing confidence intervals for the fitted COVID-19 deaths curves and its mode (date of the peak) forthe selected countries. The method proposed in this paper is implemented in R (R Core Team, 2019), and the codes areavailable for download from Github ( https://github.com/fernandalschumacher/NLMECOVID19 ).The rest of the paper is organized as follows. In Section 2, we describe the motivating COVID-19 deaths datasetsobtained from the John Hopkins repository. In Section 3, we present the methodology and associated ML estimationprocedure via the EM-algorithm. In Section 4, we present our result, where we forecast the total number of deaths inthe selected countries for the short term and days and long term and days with starting point at June 25 th The Johns Hopkins University through the Center for Systems Science and Engineering created a COVID-19 reposi-tory ( https://github.com/CSSEGISandData/COVID-19/ ) that is updated daily with data from many lo-cations of the world. The repository combine data from a variety of official agencies and others reliable sources andunify them.Due to different testing capacity between countries, subnotification is a challenge to understand the true contaminationnumbers of COVID-19 in the population. Despite that subnotification might also happen in number of deaths, they areless likely to be affected by detection biases. In fact, recent studies use the number of deaths as a proxy measure forCOVID-19 cases (see, for example, Maugeri et al., 2020; Ribeiro and Bernardes, 2020; Amaro et al., 2020). Figure 1shows the reported number of daily deaths in different countries. lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l llllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l llllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll Peru Chile ColombiaUS Brazil MexicoBelgium Italy United Kingdom0 50 100 150 0 50 100 150 0 50 100 15004008001200050010001500050100150200−25002505007500500100015000200400600−10001002003004005000100020000100200300 days since first death dea t h s Figure 1: Number of daily reported deaths since first death for the nine countries considered in this study, until 24 th of June 2020, with a LOESS non-parametric estimated curve superimposed.As it can be seen the selected countries are in different stages of the COVID-19 pandemic. Belgium, Italy, the UnitedKingdom (UK), and the United States of America (US) present a controlled number of deaths by the disease. Othercountries like Brazil, Mexico, and Peru seem to be around their peak, while Chile and Colombia apparently are in theincreasing part of the curve. Figure 1 also presents a LOESS (Cleveland, 1979) fit and a days prediction for each3 PREPRINT - A
UGUST
4, 2020series. Non-parametric curves are alternatives for model fitting because of their flexibility. As it can be seen, the curveshave difficulties to capture the rapid increase in deaths in some countries but overall provide a reasonable fit. However,when using this model to make prediction we can clearly see that their extrapolation capacities are not reliable, makingnecessary the usage of a more adequate statistical parametric model to capture the nature of the pandemic and providesensible fit and prediction.Although that Peru moved quickly to lock down its citizens as the pandemic took hold in early March and has extendedthe lockdown until the end of June 2020, cases nonetheless exploded in May, reaching a peak of more than , casesper day late in May, in part explained because the informal employment reaches of the Peruvian labor market.Peru has now reported , cases and , deaths of COVID-19 (Worldometers.info, 2020), the second-highestnumber of confirmed cases of the disease in Latin America, behind Brazil, and the seventh-highest globally.In Mexico, the accumulated COVID-19 cases reached , and a death total of , until June th . Moreover,the statistics show that the curve is ascending (Worldometers.info, 2020). Torrealba-Rodriguez et al. (2020) presenteda modeling and prediction of accumulated cases of COVID-19 infection in Mexico using the models of Gompertz andLogistic and a framework of Artificial Neural Network. These models predict the peak between May th and June th , which might be unrealistic as presented in Section 4.Colombia is in an ascending stage of the curve of cases and deaths, having accumulated , cases and , deathsuntil June, th according to Worldometers.info (2020). Rivera-Rodriguez and Urdinola (2020) used a SEIR modelto estimate the number of patients that would required Intensive Care Units (ICU) care (critical), and only hospitalcare (severe) in order to manage their limited resources. De Castro (2020) used a SIR model to estimate the numberof cases and deaths in Colombia using data of Italy and South Korea where the peak of infections would be by June th . As shown in Figure 1, Colombia presents an increasing pattern and the peak should be observed somewhere inthe future.For Chile, the statistics are of , accumulated cases and a total of , deaths until June th (Worldome-ters.info, 2020). It is important to observe that there are two days with observations far above the standard: number ofdeaths of (June th ) and number of cases of , (June th ).Lastly, Brazil presented , , cases and , deaths until June th according to Worldometers.info (2020)and is the country with second-highest number of confirmed cases and deaths in the world. CovidLPTeam (2020)estimates that the Brazilian peak of deaths should happen in June th and a estimated total number of death of , which is closely related to the predictions presented in Section 4. From Figure 1 we can see that Brazil is thecountry in more advanced stage of the evolution of the disease in the Latin American region.The different stages of spread of the disease are an important information that should be used into modeling. In thenext section we introduce our methodology that jointly accommodates the different stages of the diseases and borrowinformation of the different time series to provide a more robust and reliable fit and prediction. The idea of the SMSN distributions originated from an early work by Branco and Dey (2001), which included theskew-normal (SN) distribution as a special case. We say that a p × random vector Y follows a SN distribution with p × location vector µ , p × p positive definite dispersion matrix Σ and p × skewness parameter vector λ , (oftenknown as the shape parameter) and write Y ∼ SN p ( µ , Σ , λ ) , if its probability density function (pdf) is given by f ( y ) = 2 φ p ( y ; µ , Σ )Φ( λ (cid:62) Σ − / ( y − µ )) , (2)where φ p ( . ; µ , Σ ) stands for the pdf of the p –variate normal distribution with mean vector µ and dispersion matrix Σ , N p ( µ , Σ ) say, and Φ( . ) is the cumulative distribution function (cdf) of the standard univariate normal. Notefor λ = , (2) reduces to the symmetric N p ( µ , Σ ) -pdf, while for non-zero values of λ , it produces a perturbed(asymmetric) family of N p ( µ , Σ ) -pdf’s. Let Z = Y − µ . Since a Z ∼ SN p ( , a Σ , λ ) , for all scalar a > , theSMSN family can be defined as follows: a SMSN distribution is that of a p − dimensional random vector Y = µ + U − / Z , (3)where U is a positive random variable with the cdf H ( u ; ν ) and pdf h ( u ; ν ) , and independent of the SN p ( , Σ , λ ) –random vector Z . Here ν is a scalar or vector parameter indexing the distribution of the mixing scale factor U . Given U = u , Y follows a multivariate skew–normal distribution with location vector µ , scale matrix u − Σ and skewness4 PREPRINT - A
UGUST
4, 2020parameter vector λ , i.e., Y | U = u ∼ SN p ( µ , u − Σ , λ ) . Thus, by (2), the marginal pdf of Y is f ( y ) = 2 (cid:90) ∞ φ p ( y ; µ , u − Σ )Φ( u / λ (cid:62) Σ − / ( y − µ )) dH ( u ; ν ) . (4)The notation Y ∼ SMSN p ( µ , Σ , λ ; H ) will be used when Y has pdf (4). The asymmetrical class of SMSN dis-tributions includes the skew– t (ST), the skew–slash (SSL), and the skew–contaminated normal (SCN). All thesedistributions have heavier tails than the skew-normal and can be used for robust inferences. When λ = , theSMSN class reduces to the scale mixtures of normal (SMN) class, which is represented by the pdf f ( y ) = (cid:82) ∞ φ p ( y ; µ , u − Σ ) dH ( u ; ν ) . We use the notation Y ∼ SMN p ( µ , Σ ; H ) when Y has distribution in the SMNclass. We refer to Schumacher et al. (2020b) for details and additional properties related to this class of distributions. In this section, we present the models and methods in general forms, illustrating that our methods may be applicablein other applications as well. Denote the number of subjects by n and the number of measurements on the i th subjectby n i . For notational convenience, let x ij ( i = 1 , , . . . , n ; j = 1 , , . . . , n i ) be a vector incorporating independentvariables such as number of ICU beds, β ij = ( β ij , . . . , β sij ) (cid:62) , β = ( β , . . . , β r ) (cid:62) ( r > s ) . The NLME model canbe written as y i = η i ( t ij , β ij ) + (cid:15) i , β ij = d ( x ij , β , b i ) , (5)where the subscript i is the subject index, y i = ( y i , . . . , y in i ) (cid:62) , with y ij being the response value for individual i at time t ij , η i ( t ij , β ij ) = ( η ( t i , β i ) (cid:62) , . . . , η ( t in i , β in i )) (cid:62) , with η ( · ) being a nonlinear known function, (cid:15) =( (cid:15) i , . . . , (cid:15) in i ) (cid:62) is random error vector, d ( . ) is an s -dimensional linear function, b i = ( b i , . . . , b qi ) (cid:62) is the vector ofrandom effects ( q ≤ s ) .Following Schumacher et al. (2020a), we assume that (cid:18) b i (cid:15) i (cid:19) ind ∼ SMSN q + n i (cid:18)(cid:18) c ∆0 (cid:19) , (cid:18) D 00 σ I n i (cid:19) , (cid:18) λ (cid:19) ; H (cid:19) , i ∈ { , . . . , n } , (6)where c = − (cid:113) π k , with k = E { U − / } , ∆ = D / δ , with δ = λ / (cid:112) λ (cid:62) λ , σ is the unknown within-subject variance, D = D ( α ) is the q × q variance-covariance matrix of b i , which depends on unknown and reducedparameters α of dimension v × and λ = ( λ , . . . , λ q ) (cid:62) is a q × vector of skewness parameters for the randomeffects. Using the definition of a SMSN random vector and (6), it follows that marginally b i iid ∼ SMSN q ( c ∆ , D , λ ; H ) and (cid:15) i ind ∼ SMN n i ( , σ I n i ; H ) , i ∈ { , . . . , n } , (7)so that E { b i } = E { (cid:15) i } = . Thus this model considers that the (cid:15) i ’s, related to within-subject errors are symmetricallydistributed, while the distribution of random effects is assumed to be asymmetric and with mean zero. In addition,under this consideration the regression parameters are all comparable.Let θ = ( β (cid:62) , σ , α (cid:62) , λ (cid:62) , ν (cid:62) ) (cid:62) , then classical inference on the parameter vector θ is based on the marginal distribu-tion for Y = ( y (cid:62) , . . . , y (cid:62) n ) . Thus, the integrated likelihood of (5)-(6) for θ in this case is given by L ( θ ) = 2 n (cid:89) i =1 (cid:90) ∞ (cid:90) R q φ n i ( y i ; η i ( t ij , β ij ) , u − i σ I n i ) φ q ( b i ; c ∆ , u − i D ) × Φ( u / i λ (cid:62) D − / ( b i − c ∆ )) d b i dH ( u i ; ν ) , (8)which generally does not have a closed form expression because the model function is not linear in the random effects.In the normal case, to make the numerical optimization of the likelihood function in a tractable problem, differentapproximations to (8) have been proposed. Some of these methods consist of taking a first-order Taylor expansion ofthe model function around the conditional models of the random effects b (Lindstrom and Bates, 1990). Followingthis idea, the marginal distribution of Y i , for i ∈ { , . . . , n } , can be approximated as Y i . ∼ SMSN n i ( η i ( t ij , d ( x ij , β , (cid:101) b i )) − (cid:101) H i ( (cid:101) b i − c ∆ ) , (cid:101) Ψ i , (cid:101) ¯ λ ; H ) , (9)where (cid:101) Ψ i = (cid:101) H i D (cid:101) H (cid:62) i + σ I n i , (cid:101) H i = ∂η i ( t ij , d ( x ij , β , b i )) ∂ b (cid:62) i | b i = (cid:101) b i , (cid:101) b i is an expansion point in a neighborhood of b i , (cid:101) ¯ λ i = (cid:101) Ψ − / i (cid:101) H i D ζ (cid:113) ζ (cid:62) (cid:101) Λ i ζ , with ζ = D − / λ , (cid:101) Λ i = ( D − + σ − (cid:101) H (cid:62) i (cid:101) H i ) − , and “ . ∼ ” denotes approximated in5 PREPRINT - A
UGUST
4, 2020distribution.The approximated empirical Bayes estimator of b i , denoted by (cid:101) b i , obtained by the conditional mean of b i given Y i ,is (cid:101) b ( k ) i ( θ ) ≈ E { b i | Y i = y i , θ } ≈ (cid:101) µ bi + (cid:101) τ − i (cid:113) ζ (cid:62) (cid:101) Λ i ζ (cid:101) Λ i ζ , (10)where (cid:101) µ bi = c ∆ + D (cid:101) H (cid:62) i (cid:101) Ψ − / i (cid:101) y i and (cid:101) τ − i = E { U − / W Φ ( U / (cid:101) A i ) | Y i } , with W Φ ( x ) = φ ( x ) / Φ( x ) , x ∈ R , (cid:101) y i = Ψ − / i ( y i − η i ( t ij , d ( x ij , β , (cid:101) b ( k − i )) + (cid:101) H i (cid:101) b ( k − i − c (cid:101) H i ∆ ) and (cid:101) A i = (cid:101) ¯ λ (cid:62) i (cid:101) y i . We refer to Lachos et al.(2010), Schumacher et al. (2020b) and Schumacher et al. (2020a), for further details.
In this section, we demonstrate how to use the EM algorithm (Dempster et al., 1977) to obtain approximate maximumlikelihood (ML) estimator of a SMSN–NLME model. We denote the current estimates of ( β , b i ) by ( (cid:101) β , (cid:101) b i ) . In thiscase, the linearization procedure (Wu, 2010) consists of taking the first-order Taylor expansion of η i around the currentparameter estimate (cid:101) β and the random effect estimates (cid:101) b i , which is equivalent to iteratively solving the following linearmixed-effect (LME) model (cid:101) Y i = (cid:102) W i β + (cid:101) H i b i + (cid:15) i , (11)for i ∈ { , . . . , n } , where (cid:101) Y i = Y i − (cid:101) η i ( t ij , d ( x ij , (cid:101) β , (cid:101) b i )) , (cid:101) η i ( t ij , d ( x ij , (cid:101) β , (cid:101) b i )) = η i ( t ij , d ( x ij , β , (cid:101) b i )) − (cid:101) H i (cid:101) b i − (cid:102) W i (cid:101) β , (cid:101) Ψ i = (cid:101) H i D (cid:101) H (cid:62) i + σ I n i , (cid:101) H i = ∂η i ( t ij , d ( x ij , (cid:101) β , b i )) ∂ b (cid:62) i | b i = (cid:101) b i , (cid:102) W i = ∂η i ( t ij , d ( x ij , β , (cid:101) b i )) ∂ β (cid:62) | β = (cid:101) β , (cid:101) ¯ λ i = (cid:101) Ψ − / i (cid:101) H i D ζ (cid:113) ζ (cid:62) (cid:101) Λ i ζ , with ζ = D − / λ , (cid:101) Λ i = ( D − + σ − (cid:101) H (cid:62) i (cid:101) H i ) − , b i iid ∼ SMSN q ( c ∆ , D , λ , H ) and (cid:15) i ind. ∼ SMN n i ( , σ I n i H ) . The model defined in (11) can be seen as a slight modification of the general SMSN-LME model proposed by Lachos et al. (2010) and Schumacher et al. (2020b), where a simple and efficient EM-typealgorithm for iteratively computing ML estimates of the parameters in the SMSN-LME model has been proposed andfor which we use the skewlmm package in R (Schumacher et al., 2020c; R Core Team, 2019). The approximatedlikelihood function, derived from (9), can be easily computed as a byproduct of the EM-algorithm and is used formonitoring convergence and for model selection, such as, the Akaike (AIC) and Bayesian Information Criterion (BIC)(Wit et al., 2012). Suppose now that we are interested in the prediction of y + i , a future υ × vector of measurement of Y i , given theobserved measurement Y = ( Y (cid:62) ( i ) , Y (cid:62) i ) (cid:62) , where Y ( i ) = ( Y (cid:62) , . . . , Y (cid:62) i − , Y (cid:62) i +1 , . . . , Y (cid:62) n ) . The minimum meansquare error (MSE) predictor of y + i , defined as the conditional expectation y + i given Y i and θ , is given next.Let (cid:101) b i be an expansion point in a neighborhood of b i , y + i be an υ × vector of future measurement of Y i (or possiblymissing), x + i and t + i be an υ × r matrix of known prediction regressors. Then, under the SMSN–NLME model thepredictor (or minimum MSE predictor) of y + i can be approximated as (cid:101) y + i ( θ ) = E { y + i | Y i , θ } ≈ (cid:101) µ . + (cid:101) Ψ i . υ (2) i (cid:113) υ (2) (cid:62) i (cid:101) Ψ i . υ (2) i τ − i , (12)where (cid:101) µ . = η ( t + ij , d ( x + ij , β , (cid:101) b i )) − (cid:101) H + i ( (cid:101) b i − c ∆ ) + (cid:101) Ψ ∗ i (cid:101) Ψ ∗− i (cid:16) Y i − η ( t ij , d ( x ij , β , (cid:101) b i )) + (cid:101) H i ( (cid:101) b i − c ∆ ) (cid:17) , (cid:101) Ψ i . = (cid:101) Ψ ∗ i − (cid:101) Ψ ∗ i (cid:101) Ψ ∗− i (cid:101) Ψ ∗ i , (cid:101) Ψ ∗ i = (cid:101) Ψ i = (cid:101) H i D (cid:101) H (cid:62) i + σ I n i , (cid:101) Ψ ∗ i = (cid:101) Ψ ∗ i = (cid:101) H + i D (cid:101) H + (cid:62) i , (cid:101) Ψ ∗ i = (cid:101) H + i D (cid:101) H + (cid:62) i + σ I υ , (cid:101) Ψ ∗− / i (cid:101) ¯ λ ∗ i = ( υ (1) (cid:62) i , υ (2) (cid:62) i ) (cid:62) and τ − i = E (cid:110) U − / i W Φ (cid:16) U / i (cid:101) υ (cid:62) i ( Y i − η ( t ij , d ( x ij , β , (cid:101) b i )) + (cid:101) H i ( (cid:101) b i − c ∆ )) (cid:17) | Y i (cid:111) , PREPRINT - A
UGUST
4, 2020with (cid:101) υ i = υ (1) i + (cid:101) Ψ ∗− i (cid:101) Ψ ∗ i υ (2) i (cid:113) υ (2) (cid:62) i (cid:101) Ψ ∗ i . υ (2) i . The expression for τ − i are given in Schumacher et al. (2020b). Bootstrap methods are statistical tools used in many statistical problems such as calculus of bias, variance, confidenceintervals, sampling distributions of the estimators, among others. For technical details, we refer to Efron and Hastie(2016). In this paper, we use the parametric Bootstrap to calculate a confidence interval for the mode (date of the peak)of COVID-19 deaths by country. Also, we construct confidence bands to the estimated COVID-19 deaths curve alongthe time. We generate M random samples of length n of the model (eq. 5–6) using the estimated parameters and foreach random sample, we fit the proposed model and calculate the statistical mode, the fitted and predicted curve ( (cid:98) Y t ).So, for each country, we will have M o , ..., M o M estimates of the mode and (cid:98) Y t , ..., (cid:98) Y t M . We calculate the percentiles α/ and − α/ to construct a confidence interval of level α for the mode and for the fitted curve of the COVID-19deaths. In this notes, we use α = 5% .To propose the confidence interval for the peak of deaths, we select the two dates such that the . percentile curvecoincides with the highest value of the . percentile curve. Thus, we guarantee that all curves belonging to theconfidence band will peak within this interval. In order to analyse the COVID-19 data described in Section 2, we propose to fit the SMSN-NLME model given in (5)with the derivative of the generalized logistic model as nonlinear function, which can be written as follows: η ( t ij , β ij ) = α α i α exp {− α i t ij } ( α i + exp {− α i t ij } ) α +1 , (13)where α i = exp { β + b i } , α i = exp { β + b i } , and α k = exp { β k } , for k = 1 , , with the exponential transfor-mation being used to ensure positiveness of the parameters. Note that this function is similar to the one considered inthe univariate approach of CovidLPTeam (2020), as presented in (1), except that in (13) random effects are includedto enable a multivariate approach and borrow information between the different time series.For numerical stability, we use the linear transformation y ij = z ij /k z , where z ij is the number of registered deathsat the i th country and j th day since first death, and k z = 33 . is chosen to be the sample standard deviation fromthe Colombia data, which is the smaller one in the observed data. For model selection, we consider AIC and BIC, asdescribed in Section 3.3, for distributions normal (N), skew-normal (SN), student’s-t ( t ), and skew-t (ST). As shownin Table 1, the distribution with lowest AIC and BIC is the ST, which will be used in the analysis hereafter. It is worthnoting that the fitted ST distribution has ν = 1 . (Table 2), evidencing the need for heavy tail models.Table 1: Selection criteria for fitting SMSN-NLME models. Bold values indicate the smallest value from each crite-rion. Distribution Loglik AIC BICN -2824.9 5665.7 5704.5SN -2824.4 5668.8 5717.3 t -2360.0 4738.1 4781.7ST -2343.3 4708.6 4762.0 To obtain long term evolution estimates, we computed the 300-days-ahead prediction for all countries considered inthis study. Figure 2 presents the fitted model, along with the prediction, the real data and confidence intervals informa-tion. To preserve the individual properties observed for each country, the random error was generated conditioned onthe mixing scale factor (cid:98) u i = E { U i | (cid:98) θ , y i } , i = 1 , . . . , , which is estimated as a byproduct of the EM algorithm andis responsible to preserve the observed characteristics of the series for each country. Additionally, the 95% confidenceintervals were obtained as described in Subsection 3.5, where M = 600 samples were generated, from which themodel estimate or prediction for samples resulted in numerical error and/or non-convergence. For the remain-ing samples we performed a 15% trim of the series to remove those noise series that were not able to replicate thecharacteristics of the data. To identify such series we used as metric the mean square error (MSE) between the fittedvalues and observed data and removed those such that the MSE have the highest values. This procedure was performedto prevent convergence problems to affect the interval estimates, resulting in a final of samples.7 PREPRINT - A
UGUST
4, 2020Table 2: ST-NMLE - ML estimates for parameters, where F is such that FF = D .Parameter Estimate Parameter Estimate β F β -0.598 F -0.097 β -3.127 F β λ -39.985 σ λ ν llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Peak: Apr 12 Peak 95% CI: (Apr 05, Apr 21) lllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Apr 22 Peak 95% CI: (Apr 14, May 02) lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Jun 16 Peak 95% CI: (May 29, Jul 27) lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Apr 01 Peak 95% CI: (Mar 27, Apr 08) lllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Jun 08 Peak 95% CI: (May 24, Jun 27) lllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll
Peak: Aug 17 Peak 95% CI: (Jun 10, Apr 20) lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Apr 16 Peak 95% CI: (Apr 09, Apr 26) lllllllllllllllllll llllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllll
Peak: Jul 15 Peak 95% CI: (Jun 13, Oct 05) lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Peak: Aug 10 Peak 95% CI: (Jun 09, Jan 29)
Peru Chile ColombiaUS Brazil MexicoBelgium Italy United KingdomApr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 2021 Apr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 2021 Apr 2020 Jul 2020 Oct 2020 Jan 2021 Apr 202103006009001200030060090005010015020002505007500500100015000300600900120001002003004005000100020000100200300 date dea t h s Figure 2: Fitted and predicted curve for the ST-NLME model (blue line), along with real data (black), and bootstrap95% interval estimates for the curve (shaded blue area) and for the peak (shaded grey area), for each country.From Figure 2, we can see that for countries in more advanced stage of the evolution of COVID-19, the confidenceintervals are narrow. On the other hand, for countries in early stage such as Chile and Colombia, the uncertaintyregarding the disease evolution is much bigger, which is reflected by the wider confidence intervals and very vagueinformation about the peak.Table 3 reports predictions for the total number of deaths for each country and up to , , , and -days-ahead ofthe date from the latest observation considered (2020-06-24), using the fitted ST-NLME model (our selected model),along with its 95% confidence interval, obtained from the bootstrap results. As can be seen, based on the observeddata, the European countries and the US have the deaths by COVID-19 practically controlled, since the death estimatesare stable in all future prediction times. Brazil, Mexico and Peru are around its peak (Figure 3). Their prediction mildlyincrease with time and they present a little more imprecision as observed by their 95% confidence interval estimates ifcompared to the controlled countries. Finally, Chile and Colombia seem to be in the increasing phase. The predictednumber of deaths changes drastically between the time intervals with a wide uncertainty.Table 4 shows the total estimated number of deaths at the end of the pandemic and the estimated peak date by eachcountry. This results corroborates and reinforce the analysis presented in the previous paragraph. Moreover, Table 4presents the estimates of the parameters of the logistic dynamic of the model. As it can be seen, ˆ α (cid:29) showinga large right skewness for the pandemic dynamics. In other words, the increase phase occurs much faster than thedecrease phase as observed in many places. This is a clear effect of borrowing strength from curves in different stages.For example, for the curves from the Latin American countries, where observations are mainly on the left side of the8 PREPRINT - A
UGUST
4, 2020Table 3: Prediction for , , , and -days-ahead of the total number of deaths using the fitted ST-NLME model.Values inside parenthesis are the 95% confidence intervals. Country Total expected by 2020-07-25 Total expected by 2020-08-24Belgium 10 296 (9 675; 11 206) 10 301 (9 678; 11 212)Italy 34 100 (32 779; 35 204) 34 128 (32 798; 35 230)United Kingdom 42 181 (39 420; 44 465) 42 279 (39 495; 44 562)US 125 165 (117 506; 128 895) 126 340 (118 297; 129 911)Brazil 75 275 (70 409; 79 020) 86 786 (78 978; 91 442)Mexico 47 037 (42 284; 53 337) 68 476 (56 279; 87 550)Peru 12 664 (11 795; 14 170) 15 384 (13 937; 18 089)Chile 12 386 (9 477; 16 034) 21 640 (13 024; 35 669)Colombia 5 597 (4 306; 6 449) 9 359 (5 852; 12 001)Country Total expected by 2020-09-23 Total expected by 2020-11-22Belgium 10 302 (9 679; 11 213) 10 302 (9 679; 11 213)Italy 34 132 (32 800; 35 234) 34 133 (32 801; 35 235)United Kingdom 42 295 (39 506; 44 577) 42 299 (39 509; 44 580)US 126 627 (118 467; 130 158) 126 713 (118 512; 130 230)Brazil 91 894 (82 288; 96 861) 94 889 (83 967; 100 163)Mexico 83 630 (64 215; 113 683) 98 263 (69 998; 143 861)Peru 16 761 (14 942; 20 432) 17 692 (15 551; 22 243)Chile 30 517 (15 382; 64 188) 42 754 (17 081; 134 072)Colombia 12 782 (6 944; 18 186) 17 242 (7 783; 29 461) peak, estimation of the skewness parameter is unstable due to the lack of information after the peak. Although insimilar scale, we can see some variation between α i and α i , i = 1 , . . . , , which are important to capture the uniquecharacteristic of each time series and provide a good fit and meaningful predictions.Table 4: ST-NMLE - fitted parameters for the generalized logistic curve for the mean. The total estimated numberof death (Tot. Est. Death) at the end of the pandemic and the estimated date of the peak (Est. Peak Date) are alsopresented. Country α α i α i α Tot. Est. Death Est. Peak DateBelgium 78,771,346 1.619 0.073 18.55 10,304 2020-04-12Italy 1.518 0.061 341,37 2020-04-01United Kingdom 1.501 0.059 42,303 2020-04-16US 1.414 0.047 126,726 2020-04-22Brazil 1.436 0.031 95,476 2020-06-08Mexico 1.429 0.022 104,497 2020-07-15Peru 1.572 0.028 17,922 2020-06-16Chile 1.484 0.017 51,891 2020-08-17Colombia 1.561 0.017 20,312 2020-08-10
This article proposes a robust modeling of COVID-19 deaths based on a NLME model, where the Gaussian distributionof the random terms are replaced by the SMSN class of distributions. We believe that the proposed methods mayhave a significant impact on COVID-19 research because, in the presence of skewness and heavy tails in the data,appropriate statistical inference for COVID-19 research can lead to more accurate and reliable clinical decisions,in addition, our model can be useful to guide some policies that need to be taken by the selected Latin Americangovernments in order to overcome the COVID-19 pandemic. To the best of our knowledge, this is the first attemptin working on such general distributional structure related to COVID-19 deaths data. Our proposed method is quitegeneral and is not limited to the analysis of the selected countries and reported deaths, thus we have made the R codesavailable for download from Github ( https://github.com/fernandalschumacher/NLMECOVID19 ),9 PREPRINT - A
UGUST
4, 2020which will encourage other researchers to use NLME models and the SMSN class of distributions in their studies ofother characteristics of COVID-19 data.Understanding the dynamics of the pandemic and being able to predict its future behavior is of extreme importance.While many countries and consortia try to create an effective vaccine for the disease, the best current alternative isto flatten the contamination curve to guarantee that the health systems are able to provide the necessary care for thepopulation without being overwhelmed. With the use of countries that have controlled the pandemic number of deathswe strongly believe that our methodology can provide more reliable and meaningful prediction that allow policiesmakers in Latin America to make effective decisions.Extension of the presented methodology to the number of cases can also be performed, however, because of the largesubnotification in the data and due to the different testing capability among countries, this approach is challenging.A possible alternative to overcome this fact is to incorporate into modeling time changing covariates that adequatelycapture the subnotification dynamics of each country. Another interesting future development and work is to model thecases and deaths by COVID-19 jointly since these series are necessarily correlated. Moreover, the WHO has warnedthat the coronavirus may never go away and concerns are growing about a second (perhaps a third) wave of infections.Thus, a natural generalization of our method is to consider a finite mixture of SMSN-NLME models (Lachos et al.,2017; Zeller et al., 2019). An in-depth investigation of such extensions are beyond the scope of the present paper, butcertainly an interesting topic for (near) future research.
Acknowledgments
This paper was written while Marcos O. Prates was a visiting professor in the Department of Statistics at the Universityof Connecticut (UConn). In addition to the support of UConn, the professor would like also to thank the CovidLP groupfor the discussion about the topic and the Conselho Nacional de Desenvolvimento Cientfico e Tecnolgico (CNPq) forpartial financial support. Fernanda L. Schumacher acknowledges the partial support of Coordenao de Aperfeioamentode Pessoal de Nvel Superior - Brasil (CAPES) - Finance Code 001, and by Conselho Nacional de DesenvolvimentoCientfico e Tecnolgico - Brasil (CNPq).
References
Amaro, J.E., Dudouet, J., Orce, J.N., 2020. Global analysis of the covid-19 pandemic using simple epidemiologicalmodels. arXiv:2005.06742 .Branco, M.D., Dey, D.K., 2001. A general class of multivariate skew-elliptical distributions. Journal of MultivariateAnalysis 79, 99–113.Cleveland, W.S., 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the AmericanStatistical Association 74, 829–836.CovidLPTeam, 2020. CovidLP: short and long term prediction for COVID-19. Departamento de Estat´ıstica. UFMG,Brazil. URL: http://est.ufmg.br/covidlp/home/en/ .De Castro, C.A., 2020. Sir model for COVID-19 calibrated with existing data and projected for colombia. arXiv:2003.11230 .Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journalof the Royal Statistical Society, Series B, 39, 1–38.Efron, B., Hastie, T., 2016. Computer age statistical inference. Cambridge University Press, Cambridge.JF Salvando Todos Team, 2020. Plataforma Estatstica JF para COVID-19. Departamento de Estat´ıstica. UFJF, Brazil.URL: http://jfsalvandotodos.ufjf.br .Lachos, V.H., Ghosh, P., Arellano-Valle, R.B., 2010. Likelihood based inference for skew–normal independent linearmixed models. Statistica Sinica 20, 303–322.Lachos, V.H., Moreno, E.J.L., Chen, K., Cabral, C.R.B., 2017. Finite mixture modeling of censored data using themultivariate student-t distribution. Journal of Multivariate Analysis 159, 151–167.Lange, K.L., Sinsheimer, J.S., 1993. Normal/independent distributions and their applications in robust regression. J.Comput. Graph. Stat 2, 175–198.Lindstrom, M., Bates, D., 1990. Nonlinear mixed-effects models for repeated-measures data. Biometrics 46, 673–687.Maugeri, A., Barchitta, M., Battiato, S., Agodi, A., 2020. Estimation of unreported novel coronavirus (sars-cov-2) infections from reported deaths: A susceptible–exposed–infectious–recovered–dead model. Journal of ClinicalMedicine 9, 1350. 10
PREPRINT - A
UGUST
4, 2020Mellan, T.A., Hoeltgebaum, H.H., Mishraet, S., al., 08-05-2020. Estimating covid-19 cases and reproduction numberin brazil. Imperial College London doi: https://doi.org/10.25561/78872 .R Core Team, 2019. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Com-puting. Vienna, Austria. URL: .Ribeiro, L.C., Bernardes, A.T., 2020. Estimate of underreporting of COVID-19 in Brazil by Acute Respiratory Syn-drome hospitalization reports. Notas Tcnicas Cedeplar-UFMG. Cedeplar, Universidade Federal de Minas Gerais.URL: https://ideas.repec.org/p/cdp/tecnot/tn010.html .Rivera-Rodriguez, C., Urdinola, B.P., 2020. Predicting hospital demand during the covid-19 outbreak in bogota,colombia. medRxiv URL: , doi: .Schumacher, F.L., Dey, D.K., Lachos, V.H., 2020a. Approximate inferences for nonlinear mixed effects models withscale mixture of skew-normal distributions. arXiv:2007.15086 .Schumacher, F.L., Lachos, V.H., Matos, L.A., 2020b. Scale mixtures of skew-normal linear mixed models withwithin-subject serial dependence. arXiv:2002.01040 .Schumacher, F.L., Matos, L.A., Lachos, V.H., 2020c. skewlmm: Scale mixtures of skew-normal linear mixed models.URL: https://CRAN.R-project.org/package=skewlmm . r package version 0.2.0.Torrealba-Rodriguez, O., Conde-Guti´errez, R.A., Hern´andez-Javier, A.L., 2020. Modeling and prediction of covid-19in mexico applying mathematical and computational models. Chaos, Solitons and Fractals 138, 109946.Tsallis, C., Tirnakli, U., 2020. Predicting covid-19 peaks around the world. Frontiers in Physics 8,217. URL: , doi: .Wit, E., Heuvel, E.v.d., Romeijn, J.W., 2012. ”all models are wrong...”: an introduction to model uncertainty. StatisticaNeerlandica 66, 217–236.Worldometers.info, 2020. Covid-19 coronavirus pandemic.