An exploratory time series analysis of total deaths per month in Brazil since 2015
AA N EXPLORATORY TIME SERIES ANALYSIS OF TOTAL DEATHSPER MONTH IN B RAZIL SINCE
A P
REPRINT
Alexandre Barbosa de Lima
Biomedical EngineeringFaculty of Sciences and TechnologyPontifical Catholic University of São Paulo [email protected]
August 28, 2020 A BSTRACT
In this article, we investigate the historical series of the total number of deaths per month in Brazilsince 2015 using time series analysis techniques, in order to assess whether the COVID-19 pandemiccaused any change in the series’ generating mechanism. The results obtained so far indicate that therewas no statistical significant impact.
Keywords
COVID-19 · Time series analysis · Spectral analysis
Brazil is a country of continental dimensions, with a territorial extension of 8,510,295.914 Km , and an estimatedpopulation of 211,959,316 people, which is distributed in a federation constituted by States and the Federal District(the country has , cities) [1]. Until the time of this writing, the Brazilian Federal Ministry of Health has recorded3,622,861 Sars-CoV-2 case reports and 115,309 deaths [2] caused by the COVID-19 pandemic [3].According to the Coronavirus Resource Center of the Johns Hopkins University (JHU) [4], Brazil is the second countrymost affected by COVID-19 in the world, both in number of cases and in number of deaths.This article aims to investigate the historical series of the total number of deaths using time series analysis techniques.From now on, such series will be called historical series. More specifically, we want to assess the impact of COVID-19on the evolution of the historical series.The analysis was performed using the R software, version 4.0.2 [5]. The developed R code, as well as the database inExcel spreadsheet format, are available for public consultation and auditing on GitHub [6].The database used is that made available online by the Transparency Portal of the Civil Registry Offices of Brazil[7], which consolidates the amount of birth, marriage and death certificates available in Brazil. Online data has beenavailable since January 2015. Brazilian registries are regulated by the National Council of Justice (CNJ), which is apublic institution headquartered in Brasília, Federal District, that aims to improve the work of the Brazilian judicialsystem, especially with regard to administrative and procedural control and transparency [8]. The president of theBrazilian Supreme Court also presides the CNJ.The remainder of the paper is organized as follows. We review basic concepts of time series analysis in Section 2.Section 3 presents the technique used for modeling the historical series. This section also includes an exploratorydata analysis and a spectral analysis. Section 4 summarizes the conclusions and highlights some topics for furtherinvestigation. a r X i v : . [ s t a t . A P ] A ug n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT
In broad terms, a time series consists in a set of numbers corresponding to the observation of a certain phenomenon.Figure 1 shows the Nile River mimima for the years 622 AD to 1284 AD [9]. By nature, such numbers are realizationsof random variables. In general, a collection of random variables, { x t } , indexed by t , is referred to as a stochasticprocess [10]. In this paper, t will be discrete and vary over the integers t = 0 , ± , ± , . . . .Figure 1: Nile River mimima series.Box and Jenkis [11] introduced the class of stationary ARMA ( p, q ) (autoregressive moving average) models x t − µ = p (cid:88) j =1 φ j ( x t − j − µ ) + w t − q (cid:88) i =1 θ i w t − j , (1)where µ = E { x t } is the mean of x t , { φ , φ , . . . , φ p } and { θ , θ , . . . , θ q } are parameters of the model, and w t is awide-sense stationary white noise process with zero mean and power σ , i. e., w t ∼ (0 , σ ) . In a more compact form,we have φ ( B ) x (cid:48) t = θ ( B ) w t , (2)where x (cid:48) t = x t − µ , B is the backward shift operator ( B x t = x t − ), φ ( B ) is the autoregressive (AR) operator of order pφ ( B ) = 1 − φ B − φ B − . . . − φ p B p (3)and θ ( B ) denotes the moving average (MA) operator of order qθ ( B ) = 1 − θ B − θ B − . . . − θ q B q . (4)In the rest of this paper, we will assume µ = 0 without loss of generality.The process x t can be viewed as the output of a digital filter (ARMA filter) whose input is w t , with system function H ( z ) = θ ( B ) φ ( B ) , (5) Yearly minimal water levels of the Nile river measured at the Roda gauge near Cairo. In this work, we use the simplified notation x t to denote a discrete-time stochastic process { x t } .
2n exploratory time series analysis of total deaths per month in Brazil since 2015
A P
REPRINT where H ( z ) denotes the z -transform of the impulse response h t of the ARMA filter. An ARMA ( p, q ) process x t is saidto be wide-sense stationary (or non-explosive) if the poles of H ( z ) in (5) lie inside the complex unit circle ( | z | = 1 ),and it is invertible if the zeros of H ( z ) in (5) lie inside the unit circle. The autocorrelation function (ACF) ρ h ofan ARMA ( p, q ) process shows exponentially decay to zero, i. e., at lag h converges rapidly to zero as h → ∞ (shortmemory property) [11].A random process x t is wide sense (or weakly) stationary if its mean is constant [12, p. 298] E [ x t ] = µ x , (6)and if its ACF depends only on the lag τ = t − t : ρ x ( t , t ) = ρ x ( t , t + τ ) = ρ x ( τ ) . (7)In the literature, it is common to use the terms time series and stochastic process interchangeably [10], [13], [14], [15].From now on, we will only use the term time series . The context will indicate to the reader whether it is a process or arealization of a process. The modeling of a time series x t consists on estimating an invertible function h ( . ) , called model of x t , such that x t = h ( . . . , w t − , w t − , w t , w t +1 , w t +2 , . . . ) , (8)in which w t ∼ Independent and Identically Distributed – IID and g ( . . . , x t − , x t − , x t , x t +1 , x t +2 , . . . ) = w t , (9)in which g ( . ) = h − ( . ) . The process w t is the innovation at instant t and represents the new information about theseries that is obtained at instant t .In practice, the adjusted model is causal , i. e., x t = h ( w t , w t − , w t − , . . . ) . (10)The model construction methodology is based on the iterative cycle illustrated by Fig. 2 [11]: (a) a general class of models is considered for the analysis ( specification ); (b) there is the identification of a model, based on statistical criteria; (c) it follows the estimation phase, in which the model’s parameters are obtained. In practice, it is important that themodel is parsimonious ; and (d) at last, there is the diagnostic of the adjusted model by means of a statistical analysis of the residual series w t (is w t compatible with a white noise process?)The process x t of (10) is linear when it corresponds to the convolution of a process w t ∼ IID and a deterministicsequence h t [16][p. 377] x t = h t (cid:63) w t = ∞ (cid:88) k =0 h k w t − k = w t + h w t − + h w t − + . . . = (1 + h B + h B + . . . ) w t = H ( B ) w t (11) We assume that the autocorrelation function is given by ρ h = γ h γ , where γ h corresponds to the autocovariance of x t at lag h . We say that a model is parsimonious when it uses few parameters. The use of an excessive number of parameters is undesirablebecause the uncertainty degree of the statistical inference procedure increases with the number of parameters.
3n exploratory time series analysis of total deaths per month in Brazil since 2015
A P
REPRINT
Figure 2: Box-Jenkins’ iterative cycle.in which the symbol (cid:63) denotes the convolution operation and h = 1 .Eq. (11) is also known as the infinite order moving average (MA ( ∞ ) ) representation [17].As the models in practice are invertible, the model of x t can be rewritten in an infinite order autoregressive (AR( ∞ ))form: x t = ∞ (cid:88) k =1 g k x t − k + w t . (12)An order p AR model satisfies the equation φ ( B ) x t = w t . (13)in which φ ( B ) is an order p polynomial. In practice, the order p of an AR series is unknown and must be empirically specified. In this paper, we use aninformation criterion function [14] as will be explained below.The basic idea of an ARMA model selection criterion is to choose the orders k and l that minimize the quantity P ( k, l ) = ln ˆ σ k,l + ( k + l ) C ( N ) N , (14)in which ˆ σ k,l is a residual variance estimate obtained by adjusting an ARMA( k, l ) model to the N series observations,and C ( N ) is a function of the series size.The quantity ( k + l ) C ( N ) N is called penalty term and it increases when the number of parameters increases, while ˆ σ k,l decreases.Akaike proposed the information criterium [18], [19] AIC ( k, l ) = ln ˆ σ k,l + 2( k + l ) N , (15)known as AIC, in which ˆ σ k,l is the maximum likelihood estimator of σ w for an ARMA( k, l ) model.4n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT
Upper bounds K and L for k and l must be specified. Eq. (15) has to be evaluated for all possible ( k, l ) combinationswith ≤ k ≤ K ≤ l ≤ L . In general, K and L are functions of N , for example, K = L = ln N .For the case of AR( p ) models, (15) reduces to AIC ( k ) = ln ˆ σ k + 2 kN , k ≤ K. (16) Having identified the AR model’s order p , we can go to the parameters estimation phase. The methods of moments,Least Squares and Maximum Likelihood may be used [20], [21]. As, in general, the moments estimators are not good[21], statistical packages as R and S-PLUS use some Least Squares or Maximum Likelihood estimator.
If a process which corresponds to the difference of order d = 1 , , . . . of x t y t = (1 − B ) d x t = ∆ d x t (17)is stationary, then y t can be represented by an ARMA( p, q ) model φ ( B ) y t = θ ( B ) w t . (18)In this case, φ ( B )∆ d x t = θ ( B ) w t (19)is an ARIMA( p, d, q ) model and we say that x t is an “integral” of y t [21] because x t = S d y t . (20)The ARIMA( p, d, q ) model H ( z ) = θ ( z ) φ ( z )(1 − z − ) d (21)is marginally stable [22], as it has d roots on the unit circle. Also, x t of (19) is a homogeneous non-stationary process(meaning non-explosive ) or having unit roots [13], [14], [21].Observe that [21][p.139]: (a) d = 1 corresponds to homogeneous non-stationary series with respect to the level (they oscillate around a meanlevel during a certain time and then jump to another temporary level); (b) d = 2 corresponds to homogeneous non-stationary series with respect to the trend (they oscillate along a directionfor a certain time and then change to another temporary direction).The ARIMA model (19) may be represented in three ways: (a) ARMA( p + d, q ) (similar to Eq. (1)) x ( t ) = p + d (cid:88) k =1 ϕ k x ( t − k ) + w ( t ) − q (cid:88) k =1 θ k w ( t − k ); (22) (b) AR( ∞ ) (inverted format), given by (12) or (c) MA( ∞ ), according to (11). 5n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT
Consider the model y t ∼ I (1) y t = y t − + x t , (23)in which x t is a stationary process. If we assume the initial condition y , (23) can be rewritten as an integrated sum y t = y + t (cid:88) j =1 x j . (24)The integrated sum (cid:80) tj =1 x j is called stochastic trend and it is denoted by T S t . Observe that T S t = T S t − + x t , (25)in which T S = 0 .If x t ∼ N (0 , σ x ) in (23), then y t is known as random walk .Including a constant in the right side of (23), we have a random walk with drift , y t = θ + y t − + x t . (26)Given the initial condition y , we can write y t = y + θ t + t (cid:88) j =1 x j = T D t + T S t (27)The mean, variance, autocovariance and ACF of y t are given by [21] µ t = y + tθ (28) σ ( t ) = tσ x (29) C k ( t ) = ( t − k ) σ x (30) ρ k ( t ) = t − kt . (31)Observe that ρ k ( t ) ≈ when t >> k and the literature states that the random walk has “ strong memory ” [14].The random walk’s Sample Autocorrelation Function (SACF) decays linearly for large lags. Spectral analysis is a well-established research area [20]. However, the estimation of the power spectrum of a signal isnot a trivial matter. There are two classes of spectral analysis techniques currently in use: parametric (or model-based)and nonparametric analysis. Both methods are used in this work.The fundamental idea of parametric spectral analysis is fairly simple. The parametric approach assumes that thesignal satisfies a generating model, such as an AR ( p ) , with known functional form and then proceed by estimating theparameters in the assumed model [20], [23]. The most widely used form of parametric Power Spectral Density (PSD)estimation uses an AR ( p ) model [20].Let us now consider the nonparametric (or classical) method. Then, the estimation of PSD of a time series x t can bemade using periodogram methods based on the Discrete Fourier Transform (DFT), which can be efficiently calculatedby a Fast Fourier Transform (FFT) Algorithm. 6n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT
In the sequence we present the periodogram method.Consider a time series x t with N values or samples, i. e., x t = 0 outside the time interval ≤ t ≤ N − . In somecases of interest, we consider that x t has a size N , even if its actual size is M ≤ N (in such cases the series x t must becompleted with ( N − M ) zeros (zero padding).The equation (32) is the DFT of x t : X [ k ] = N − (cid:80) t =0 x t e − j π kN t , ≤ k ≤ N − , otherwise (32)The PSD of x t can be estimated by calculating the periodogram, given by P x ( f k ) = | X [ k ] | N (33)where X [ k ] denotes the DFT of x t , and f k = 0 , (cid:18) N (cid:19) , . . . , (cid:18) kN (cid:19) , . . . , (cid:18) N − N (cid:19) The periodogram is an asymptotically unbiased spectral estimator of the PSD of a random signal [20], [23]. Its mainproblem lies in its large variance. In other words, the periodogram is inconsistent (i. e., the dispersion of the estimatesis independent of N ). This motivates the use of a “refined periodogram method”, like the Daniell method.It is possible to show that the periodogram values P x ( f k ) are asymptotically uncorrelated random variables [20]. Thuswe may reduce its large variance by weight averaging the periodogram over small intervals centered on the currentfrequency f k . The practical form of the Daniell estimate can be performed using the FFT. This work uses a Daniellkernel for nonparametric PSD estimation. For further details, please refer to [20] and [23]. Figure 3 shows the historical series in Brazil from January 2015 to July 2020 ( samples).Figure 3: Historical series in Brazil from January 2015 to July 2020.The series in the Fig. 3 strongly suggests that we can specify a model of the type [10][p. 58] x t = µ t + y t (34)7n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT where x t are the observations, y t is a stationary process, and µ t denotes a linear trend given by the regression model µ t = β + β t (35)in which β and β are the intercept and the slope parameters.Table 1 shows the estimated coeffcientes for (35) and its p -values, which are negligible. The goodness of fit issummarized by the R of the regression [13][p. 169]. Note that the R statistics given by Table 2 indicate that theproposed model for µ t explain approximately . of the total variability of the data. The great value of the F-statisticand the neglibible model p -value show that we can not reject the null hypothesis of linear regression. Thus, we canconsider a linear model for (35) to be statistically significant given the statistical significance level of 0.01. Figure 4shows the time series with the superimposed linear regression model.Table 1: Estimated model for µ t . (cid:99) β p -value of (cid:99) β (cid:99) β p -value of (cid:99) β , . < . e − < . e − Table 2: Adequacy of the regression model for µ t .R-squared F-statistic model p -value . . < . e − Figure 4: Time series with the superimposed linear regression model.To verify that the model (35) is appropriate, it is also necessary to investigate the residuals. This is what is calledresidual analysis.The residuals of (35 ) correspond to the discrepancies between the observed values ( µ ) and the values adjusted ( ˆ µ ) bythe model. The i-th residual is given by ˆ e i = µ i − ˆ µ i . (36)Figure 5 shows the residuals vs fitted model and the Quantile-Quantile plot (QQ-plot) of the residuals, which suggeststhat they are normally distributed.Table 3 shows a residual diagnostics using the Jarque-Bera and Shapiro-Wilks tests [13][p.61] for the null hypothesis ofnormality of the residuals. The null of normality is not rejected using either tests.At this point, we can conclude that: 8n exploratory time series analysis of total deaths per month in Brazil since 2015 A P
REPRINT (a) (b)
Figure 5: (a) and (b): residuals vs fitted model and residuals QQ-plot, respectively.Table 3: residual diagnostics.Jarque-Bera test Shapiro-Wilk testStatistic p -value Statistic p -value . . . . • there is no statistical evidence that COVID-19 affected the deterministic linear trend of the historical series, i.e., on average, the monthly growth in the number of deaths, which is approximately deaths/month, did notchange since the first recorded death in Brazil on March 16, 2020; and • there is no change point in the deterministic linear trend of the historical series. The first step in exploratory analysis is to remove the deterministic trend of Eq.(35) [10][p. 58]. There are twoalternatives: remove the line estimated by the regression or take the first difference in the series. As our goal is to coercethe data to (a possible) stationarity, then differencing may be more appropriate [10][p. 61]. The first two samples of thehistorical series were discarded so that the series corresponding to the first difference has 64 points, that is, points,which facilitates the spectral analysis with FFT.The first difference can be denote as ∆ x t = x t − x t − = r t . (37)Figure 6 shows the series r t (we also demeaned the series) and its SACF.Figure 7 shows the histogram of r t with a superimposed normal distribution and its Q-Q plot, which suggests that r t follows a normal distribution.Figure 8 shows the smoothed periodogram using the Daniell method and the PSD for the estimated model of r t , whichis an AR( ).We used the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test [24] for the null hypothesis that r t is level stationary, i. e.,that the series is I( ). We obtained a p -value of . , which means that one can not reject that r t is level stationary.Finally, but not least, an informal analysis of Fig. 6 suggests that the variance of r t has a change point around t = 22 (2017/jan). However, if this really happened, which we cannot guarantee with the techniques employed in this article,ocurred long before the outbreak of COVID-19 in Brazil. This possible change point motivates a time-frequency domainanalysis using wavelets, as the localized nature of wavelet coefficients allows one to analyze the evolution of the seriesvariance over time [20]. This will be investigated in future work. In this paper, we presented an exploratory time series analysis of the historical series of the total number of deaths permonth in Brazil since 2015. Our preliminary results indicate that:9n exploratory time series analysis of total deaths per month in Brazil since 2015
A P
REPRINT (a) (b)
Figure 6: (a) and (b): first difference time series and its SACF, respectively. (a) (b)
Figure 7: (a) and (b): histogram with a superimposed normal distribution (in red) and its Q-Q plot, respectively. (a) (b)
Figure 8: (a) and (b): periodogram (Daniell Method) and estimated PSD for the AR (11) model, respectively.10n exploratory time series analysis of total deaths per month in Brazil since 2015
A P
REPRINT • there is no statistical evidence that COVID-19 affected the deterministic linear trend of the historical series, i.e., on average, the monthly growth in the number of deaths, which is approximately deaths/month, did notchange since the first recorded death in Brazil on March 16, 2020; • there is no change point in the deterministic linear trend of the historical series; • there is significant statistical evidence that the first difference time series is stationary; and • there is no statistical evidence that COVID-19 provoked a change in the stochastic process that generates thetime series under analysis .These results are thought provoking and not intuitive. COVID-19 has caused many deaths around the world. This isan indisputable fact. However, our results suggest that this disease does not have so far an additive effect on the totalnumber of deaths per month in Brazil. What would be a plausible explanation for this strange result?In any case, further research should be carried out to confirm the results obtained.In future work, we will analyze the historical series using wavelet methods. References
Mémoire sur l’histoire du Nil , ser. Mémoires de l’Institut d’Egypte: Institut d’Egypte.Le Caire : Imprimerie de l’Institut français d’archéologie orientale, 1925. [Online]. Available:https://books.google.com.br/books?id=VP66nQEACAAJ[10] R. H. Shumway and D. S. Stofer,
Time Series Analysis and Its Applications with R Examples , 2nd ed. Springer,2006.[11] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel,
Time Series Analysis: Forecasting and Control , 3rd ed. PrenticeHall, 1994.[12] A. Papoulis,
Probability, Random Variables, and Stochastic Processes , 3rd ed. McGraw-Hill, 1996.[13] E. Zivot and J. Wang,
Modeling Financial Time Series with S-PLUS . Springer, 2003.[14] R. S. Tsay,
Analysis of Financial Time Series , 2nd ed. Hoboken, New Jersey: John Wiley and Sons, 2005.[15] A. B. de Lima and J. R. de Almeida Amazonas,
Internet Teletraffic Modeling and Estimation . Gistrup: RiversPublishers, 2013.[16] G. Samorodnitsky and M. S. Taqqu,
Stable non-Gaussian random processes . London, UK: Chapman & Hall,1994.[17] P. J. Brockwell and R. A. Davis,
Introduction to Time Series and Forecasting . New York: Springer-Verlag, 1996.[18] H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in , B. N. Petrov and F. Csaki, Eds., Akademia Kiado, Budapest, 1973, pp.267–281.[19] ——, “A new look at the statistical model identification,”
IEEE Transactions on Automatic Control , vol. AC-19,pp. 716–723, 1974. The random process is the series’ generating mechanism.
11n exploratory time series analysis of total deaths per month in Brazil since 2015
A P
REPRINT [20] D. B. Percival and A. T. Walden,
Spectral Analysis for Physical Applications . New York: Cambridge, 1993.[21] P. A. Morettin and C. M. C. Toloi,
Análise de Séries Temporais . São Paulo, SP: Edgard Blücher ltda., 2004.[22] J. G. Proakis and D. G. Manolakis,
Digital Signal Processing , 4th ed. Pearson Prentice Hall, 2007.[23] P. Stoica and R. Moses,
Spectral Analysis of Signals . Pearson Prentice Hall, 2005.[24] D. Kwiatkowski, P. C. B. Phillips, P. Schmidt, and Y. Shin, “Testing the null hypothesis of stationarity against thealternative of a unit root,”