Sie sind auf Seite 1von 12

Elsevier Editorial System(tm) for

Computational Statistics and Data Analysis


Manuscript Draft

Manuscript Number:

Title: Estimating the number of deaths due to COVID-19 in Lima and Peru
during March and April 2020 using ARIMA time series and modeling

Article Type: Research Paper

Section/Category: II. Statistical Methodology for Data Analysis

Keywords: Covid-19, Mortality, Peru, Inference, ARIMA.

Corresponding Author: Professor Eduardo Gonzalo Villarreyes Peña, M.D

Corresponding Author's Institution: UNMSM

First Author: Eduardo Gonzalo Villarreyes Peña, M.D

Order of Authors: Eduardo Gonzalo Villarreyes Peña, M.D; Ana E Luna, PhD;
Andres J Soriano, M.D

Research Data Related to this Submission


--------------------------------------------------
Title: Data for: Estimating the number of deaths due to COVID-19 in Lima
and Peru during March and April 2020 using ARIMA time series and modeling
Repository: Mendeley Data
https://data.mendeley.com/datasets/hzb89bv24d/draft?a=4acb8380-e428-477e-
b4cc-87e8582f451d
Cover Letter

B.U. Park
Co-Editors
Computational Statistics & Data Analysis
Seoul National University Department of Statistics, 1 Gwanak-ro, Gwanak-gu, 08826, Seoul,
Korea, Republic of
June 27, 2020

Dear B.U. Park

I’m writing to you because we wish to submit our original research article entitled " Estimating
the number of deaths due to COVID-19 in Lima and Peru during March and April 2020 using
ARIMA time series and modeling” for consideration by Computational Statistics & Data Analysis.
The death toll caused by the COVID-19 pandemic is of utmost importance due to the
current situation worldwide. In this paper, when estimating the number of deaths, we have used
statistical analysis in time series together with ARIMA predictions and we have estimated the
number of deaths due to COVID-19 in Peru and the city of Lima during March and April 2020.
We truly believe that the findings presented in our paper will call the attention of statisticians and
researchers who subscribe to Computational Statistics & Data Analysis. Besides, our findings will
also contribute to develop new tools in order to quantify the underreporting of deaths caused by
COVID-19 pandemic.
Each of the authors confirms that this manuscript is original and has not been previously
published, nor it is currently under consideration by any other journal. Additionally, all of the
authors have approved the contents of this paper and have agreed to the Computational Statistics
& Data Analysis submission policies.
Each named author has substantially contributed to conducting the underlying research
and drafting this manuscript. Additionally, to the best of our knowledge, the named authors have
no conflict of interest, financial or otherwise.

Thank you for your consideration of this manuscript.


Sincerely,
Eduardo G. Villarreyes Peña
Facultad de Ciencias Físicas UNMSM, Ciudad Universitaria
Lima 1, Perú
eduardo.villarreyes@unmsm.edu.pe
+51917315021

Ana E. Luna Adan


Departamento Académico de Ingeniería, Universidad del Pacífico
Lima 11, Perú.
ae.lunaa@up.edu.pe
+51949401199
*Manuscript
Click here to view linked References

Estimating the number of deaths due to COVID-19 in Lima and Peru


during March and April 2020 using ARIMA time series and modeling

Eduardo Villarreyes a, c, Ana Luna b, Andrés Soriano c


a
Faculty of Physics UNMSM, University City, Lima 1, Perú
b
Academic Department of Engineering Universidad del Pacífico, Lima 11, Perú.
c
Faculty of Mathematical Sciences, UNMSM, University City, Lima 1, Perú

The SARS-CoV-2 virus, which causes COVID-19 disease, is a large family of viruses that cause
respiratory disease and complications in humans. Nowadays, the pandemic is studied by the rate of
infection and mortality in different countries and even its fast spread has caught the attention of
researchers worldwide. By using inference modeling techniques and statistical analysis in time series
together with ARIMA predictions is possible to estimate deaths from COVID-19. Our investigation was
carried out in the city of Lima and Peru where the detailed analysis was done during the months of
March and April in 2020. When we compared the death toll provided by the Ministry of Health
(MINSA), we had obtained a difference of approximately 185.9% regarding the number of deaths due
to COVID-19.

Keywords: Covid-19, Mortality, Peru, Inference, ARIMA.


1. Introduction where at least more than 87,000 people have
died during the coronavirus pandemic
The first case of coronavirus was reported in
compared to the official datum [10]. In Peru,
Wuhan, China in December 2019 and was
the minister of MINSA has acknowledged the
characterized as a pandemic on February 26th,
existence of underreporting cases in all diseases
2020 in different countries such as Italy, Spain,
and not only in the epidemic [11].
Iran, Korea, EE. UU., etc. In Peru, the first
coronavirus case was reported on March 6th, The difference between the excess of deaths
2020 [1, 2]. After the government published in concerning the average of previous years and
El Peruano newspaper, the Supreme Decree the number of deaths officially reported due to
N° 044-2020-PCM on March 15th, 2020, the COVID-19 reaches, in some cases, a significant
president declared a nationwide state of difference of 80% between March and April in
emergency to reduce rigorously the spread of 2020 [11, 12]. Therefore, the real picture and
the virus. On May 1st, 2020 at 0:00 a. m., Peru the data would be incomplete.
has 40,459 infected people and a total of 1,124
An alternative to reduce this difference is the
deaths [3].
use of an Autoregressive Integrated Moving
Several studies show and analyze the Average (ARIMA) model presented by Box and
differences between the official datum reported Jenkins in 1976. One of the advantages that
by the government and the real datum within show, due to its benefits to adequately model
the field of public health [4, 5, 6, 7, 8, 9]. The the behavior of health events, is its growing and
difference between these two data was international use in the area of public health
reported through different newspapers, not [13]. In Ref. [14] several models are studied and
only in Peru but also in other 25 countries compared. The author concludes that ARIMA

Corresponding author.
E-mail addresses: eduardo.villarreyes@unmsm.edu.pe (E. Villarreyes),
ae.lunaa@up.edu.pe (A. Luna), jose.soriano@unmsm.edu.pe (A. Soriano).
modeling is the best univariate model to predict throughout Peru with the number obtained
the number of infant deaths caused by Acute from previous years. During March 2020, the
Respiratory Infections (ARI). The effectiveness difference reached 222 deaths in Peru and 1006
of ARIMA modeling was also reflected in a study deaths for the particular case of Lima. In April,
whose aim was to predict cancer mortality in the difference increased to 3,763 deaths in Peru
Spain [15]. Therefore, a correct implementation and 3,202 in Lima. When calculating the
of the model allows adequate inferences to be standard deviation and the coefficient of
made about unknown or unexplored variation (C.V) for the city of Lima, the values
phenomena in the field of biomedical science obtained were 525.38 and 20.09%, respectively
[16]. The authors of the research [17] highlight for March and 1616.56 and 67.29% for April.
the predictive performance and the certainty in
Thus, homogeneous data has become
their prediction periods of ARIMA models with a
heterogeneous and shows that the arithmetic
seasonal component to be used as a
mean value for this last month is not reliable to
management tool for the diverse queries.
be such an analysis tool (C.V > 30%).
In this study, we propose the use of official data
Likewise, in the case of Peru, there is a standard
obtained from the National Death Registry
deviation value and a coefficient variation of
Information System (SINADEF) regarding deaths
599.10 and 6.57% for March, rising in April to
in Peru during the last three years [18] and the
1958 and 23.28% respectively (Fig. 3 and Fig. 4).
official information of deaths due to
coronavirus by the Ministry of Health (MINSA)
during March and April in 2020 [3, 19].
Consequently, we will combine the techniques
of time series analysis and ARIMA modeling.

Thus far, none of the research studies in Peru


have used the set of recently mentioned models
regarding the predictions of the death toll due
to COVID-19. This research aims to present and
validate a rigorous method to estimate the
number of deaths in Lima and Peru as a
consequence of the pandemic.
Fig. 1. Deaths that occurred in Peru from
January 2017 to April 2020. An atypical upward
2. Number of deaths in Lima and throughout trend is observed in April 2020.
Peru

Based on the information from the National


Death Registry Information System (SINADEF),
we have obtained the number of deaths in Lima
and throughout Peru in the last three years
[18]. Consequently, we have made several time
series charts regarding the deaths that occurred
between the years 2017 and 2020. In Fig. 1 and
Fig. 2, we observe a notable increase number of
deaths during April 2020. The trend in the city
of Lima and Peru is the same. In the same
Fig. 2. Deaths in Lima from January 2017 to
atypical period, we compared the arithmetic
April 2020. The growing trend of deaths in April
mean of deaths reported in Lima and
2020 is very marked.
Fig. 3. The deaths that occurred in Peru in 2020
compared to the average in the years 2017-
Fig. 5. Deaths that occurred in Lima and Peru
2019. There is a marked difference between the
from January 2017 to April 2020.
average and those registered in April 2020.

3. ARIMA models

Integrated Autoregressive Moving Average


(ARIMA) models are a set of very powerful
techniques for the analysis of time series that
includes the study of people, observed groups,
or varied information of a phenomenon at
successive moments in time [20]. This
technique allows us to study the conditional
relationship of causes between different
Fig. 4. Deaths that occurred in Lima in 2020 variables that change over time. It is one of the
compared to the average in the years 2017- most widely used tools to make forecast
2019. There is an appreciable difference inferences about the future and prediction of
between January to April according to the data. It is applied in different disciplines of
average. knowledge such as health, econometrics,
engineering, etc.

The methodology of ARIMA models consists of a


Based on the time series chart, the data of the time series that comprise the combination of
total number of deaths that occurred in Lima autoregressive, differencing, and moving
and Peru from the beginning of January 2017 to average term. These begin from the assumption
April 2020 show the existence of stationarity. of linear relationships where the current value
In Fig. 05, we observe a non-seasonal behavior of the variable of interest is expressed as a
of deaths in Lima and Peru and we also observe linear combination of terms such as lagged
a very marked increase of deaths in April and variable, current and past values of a Gaussian
similar behavior of the time series. This trend white noise process. [20]
gives us a detailed analysis of deaths by using
the time series from January 2017 to January In non-stationary processes, we analyze a type
2020. Moreover, we have obtained the of lack of stationarity in the arithmetic mean
predictions for February, March, and April 2020. which is very frequent in practice. Seasonality
makes the arithmetic mean of the observations
not constant; however, it evolves predictably
according to a cyclical pattern. The most
common case is to incorporate seasonality into if it exists, determine long-term predictions
the ARIMA modeling in a multiplicative way and while the stationary operators, Autoregressive
it will result in a seasonal multiplicative ARIMA (AR) and Moving Average (MA) determine the
model. [21] short-term predictions.

This model is characterized by having the


The general solution of the final prediction
expression shown in equation (1),
equation of a seasonal process will have 3
(1) components:
a. An expected trend term that will be a
where: polynomial of degree d with coefficients that
: is the seasonal autoregressive operator adapt over time if there is no constant in the
of order P. model and a polynomial of degree d + 1 with
: is the regular autoregressive operator of the coefficient of the highest order term,
order p. deterministic and given by , where
: represents seasonal differences. μ is the arithmetic mean of the stationary
series.
: represents regular differences.
b. A seasonal component expected to change
: is the seasonal moving average
with initial conditions.
operator of order Q.
c. A transient term of a short-term prediction
: is the regular moving average operator
that will be determined by the roots of regular
of order q.
and seasonal AR operators [22, 23, 24].
: it is a white noise process.

Equation (1) represents correctly the seasonal


4. Methodology
series and they are written in a simplified form
as the model
First, we will use the time series of the number
ARIMA (P, D, Q) × (p, d, q).
of deaths occurred between January in 2017 to
where:
January in 2020, we will model each ARIMA
P: is the order of the non-stationary
time series (P, D, Q)x(p, d, q) for Lima and Peru.
autoregressive part.
Then, we will use the ARIMA prediction for
D: is the number of non-stationary unit-roots
March and April in 2020 which will be analyzed
(order of process integration).
together with the data of deaths due to COVID-
Q: is the order of the non-stationary moving
19 reported by MINSA [3, 19]. We will also use
average part.
the information from February in 2020 to
p: is the order of the stationary autoregressive
determine the goodness of fit of our method
part.
along with the respective statistics (stationary
d: is the number of stationary unit-roots (order
r2 and Ljung-Box test). In the following
of integration of the process).
paragraphs, we will estimate the number of
q: is the order of the stationary moving average
deaths due to COVID-19 taking as a reference to
part.
our predictions from the ARIMA modeling and
subtracting it from the data of SINADEF.
In particular, forecasting with ARIMA models
make use of optimal prediction functions and
The best model for deaths in Peru is an ARIMA
they take into account those that minimize on
(9,2,2)x(0,0,2) where its non-seasonal
average the squared prediction errors. The
component is an autoregressive order equal to
prediction of an ARIMA model has a relatively
nine, moving averages equal to two and
simple structure, the non-stationary operators,
differencing that is equal to 2. Its stationary part
that is to say, the differences and the constant,
within the autoregressive order is zero, its part
of moving averages is equal to zero and its Fig. 6. The death toll in Peru from 2017 to 2020
differencing is equal to 2. It should be noted is shown along with the best ARIMA model.
that a transformation has been made into a
By the same token, the best model for the
natural logarithm and the cyclicality of the time
number of deaths in the city of Lima is an
series from twelve months ago has also been
ARIMA model (5.2.8)x(1.0.1) where its non-
taken into account in each observation.
seasonal component is an autoregressive order
equal to five, moving averages equals two and
In order to validate the time series, Table 01
whose differencing is equal to 8, its stationary
shows the important statistics and the
part within the autoregressive order is one, its
significance level of p = 0.135 greater than 0.05
part of moving averages is equal to zero and its
and r2 = 0.704 (stationary), this shows us the
differencing is equal to 1. It has been carried
goodness of fit and the degree of reliability for
out a transformation of the natural logarithm
the model. Table 02 shows the expected
type.
predictions of deaths in Lima with their
respective confidence intervals. In order to validate the time series, Table 3
shows the important statistics such as the level
of significance that is p = 0.187 greater than
0.05 and r2 = 0.503 (stationary), the goodness of
fit,, and the degree of reliability. Table 04 shows
the predictions of expected deaths in Peru with
Table 1. Most relevant statistics ensure the their respective confidence intervals.
goodness of fit for the ARIMA time series
regarding the death toll in Peru.

Table 3. Most relevant statistics determine the


goodness of fit for the ARIMA time series
regarding the deaths that occurred in Lima.

Table 2. Predictions of death toll through the


ARIMA time series in Peru.

ARIMA model closely follows the pattern of


time series regarding deaths from January 2017
to January 2020 in Peru. We can observe this in
Fig. 6. Table 4. Predictions of the death toll in Lima
using the ARIMA time series.
Deaths in Peru and the ARIMA time
series (9,2,2)x(0,0,2)

The ARIMA model closely follows the pattern of


time series regarding the deaths that occurred
from January 2017 to January 2020 in Lima.
FIG 6 IS MISSING We can observe this in Fig. 7.
Fig. 7. Death toll registered in Lima where we
observe that there was an increase in April due
to COVID-19 pandemic.

According to the prediction of the ARIMA time


series, for March a total of 9209 deaths were
Deaths in Lima and ARIMA time series expected in Peru. SINADEF has reported 9,340
(5,2,8)x(1,0,1)
deaths, from subtracting these numbers we
have a difference of 131 deaths. Likewise, for
April, according to the prediction of the ARIMA
time series, 9096 deaths were expected in Peru.
SINADEF has registered 12178 deaths
(the highest registered over all previous years)
and from subtracting these numbers we obtain
the difference of 3082 deaths. As well as for the
case of Lima, the increase in deaths for April is
directly related to the COVID-19 pandemic. We
observe this in Fig. 8.
Fig. 7. The death toll in Lima from 2017 to 2020
is shown along with the best ARIMA model.

5. Estimating the number of deaths in Lima and


Peru

According to the predictions given by the


ARIMA time series, in the case of Lima, a total
of 3,573 deaths were expected for March.
SINADEF has reported 3,621 deaths which give
us a difference of 48 deaths. In the same way,
for April, according to the prediction of the Fig. 8. Death toll registered in Peru, where we
ARIMA time series, 3561 deaths were expected observe that for April, there was an increase
in Lima. SINADEF has reported 5,604 deaths due to the COVID-19 pandemic.
which give us a difference of 2,043 deaths. The
increased variation in deaths in April is directly
related to the increase of infected people and, 7. CONCLUSIONS
thereby, deaths that occurred due to Covid-19
compared to other months. We can observe ARIMA time series model techniques are a
this in Fig. 7. powerful statistical tool to predict data that has
a very well-known used in various areas of
science, health, and economics. In the present
study, we have shown the discrepancy between
the data published by MINSA and the projection
in our ARIMA models with a confidence level of
95% for March and April in 2020.

We have concluded that up to April 30, 2091


deaths occurred in Lima due to COVID-19.
MINSA has reported 491 deaths which means
that our model has detected an increase in
deaths by 325.9%.
We have concluded as well that as a result of https://www.dge.gob.pe/portal/docs/tools/cor
COVID-19, up to April 30, 3,213 deaths have onavirus/coronavirus300420.pdf [Accesses 01
occurred in Peru. MINSA has reported 1,124 05 2020].
deaths which means that our model has [4] Koch E. Bravo M. et al. (2012).
detected an increase of deaths by 185.9%. Sobrestimación del aborto inducido en Colombia
y otros países latinoamericanos [Online].
Available
http://handbook.usfx.bo/nueva/vicerrectorado
The difference of death toll between SINADEF /citas/SALUD_10/Medicina/17.pdf [Accesses 12
death records and ARIMA prediction model that 05 2020].
have taken place in Lima and Peru are directly [5] Giner L. Guija J. (2014). Número de suicidios
and indirectly related to the COVID-19 en España: diferencias entre los datos del
pandemic. For instance, we indirectly have Instituto Nacional de Estadística y los aportados
deaths of patients with previous chronic por los Institutos de Medicina Legal. [Online].
diseases such as oncological diseases, diabetes, Available https://www.elsevier.es/es-revista-
patients with kidney problems, neuronal revista-psiquiatria-salud-mental-286-pdf-
problems, and many more. The situation of the S1888989114000056 [Accesses 05 05 2020].
patients has got worse due to the collapse of [6] Ruiz M. Márquez L. Miller T. (2015). La
our health system during the current pandemic. mortalidad materna: ¿por qué difieren las
mediciones externas de las cifras de los países?
For this reason, it is important to carry out a Serie Población y desarrollo. Santiago de Chile.
research regarding the quality of the data as Publicación de las Naciones Unidas.
well as to the ones related to projections about [7] Velásquez Hurtado J. E., Kusunoki Fuero L.,
the collapse of the health system, estimations, Paredes Quiliche T. G., Hurtado La Rosa R.,
inferences, and deaths reported with ARIMAX- Rosas Aguirre Á. M., & Vigo Valdez W. E.
type, hybrid or logistic-type models in order to (2014). Mortalidad neonatal, análisis de
registros de vigilancia e historias clínicas del
have new statistical tools for upcoming events.
año 2011 en Huánuco y Ucayali, Perú. Revista
peruana de medicina experimental y salud
pública, 31, 228-236.
BIBLIOGRAPHY [8] McIver D. J., and Brownstein J. S. (2014).
Wikipedia usage estimates prevalence of
[1] Ministerio de Salud (MINSA). (2020). Plan influenza-like illness in the United States in near
Nacional de Preparación y Respuesta frente al real-time. PLoS computational biology 10.4.
[9] Eze-Nliam C., et al. (2012). Discrepancies
riesgo de introducción del Coronavirus 2019-
between the medical record and the reports of
nCoV. Resolución Ministerial N 039-2020-
patients with acute coronary syndrome
MINSA. [Online]. Available regarding important aspects of the medical
https://cdn.www.gob.pe/uploads/document/fil history. BMC health services research 12.1: 78.
e/505245/resolucion-ministerial-039-2020- [10] Diario The New York Time. (2020). 74,000
MINSA.PDF [Accesses 02 05 2020]. Missing Deaths: Tracking the True Toll of the
[2] Agencia de noticias EFE. (2020). El Primer Coronavirus Outbreak. [Online]. Available
caso de coronavirus en Perú está aislado en https://www.nytimes.com/interactive/2020/04
casa. [Online]. Available /21/world/coronavirus-missing-
https://www.efe.com/efe/america/sociedad/el- deaths.html?action=click&module=RelatedLinks
primer-caso-de-coronavirus-en-peru-esta- &pgtype=Article [Accesses 14 05 2020].
aislado-casa-y-trabaja-latam/20000013- [11] Diario el comercio. (2020). ¿Exceso de
4189708 [Accesses 05 05 2020]. muertes tiene relación directa con el COVID-19?
[3]Ministerio de Salud (Minsa). (2020). Situación En abril hubo 2 mil más respecto al año pasado.
actual “COVID-19” al 30 de abril del 2020. [Online]. Available
[Online]. Available https://elcomercio.pe/peru/coronavirus-en-
peru-la-cifra-de-muertes-durante-la-pandemia-
bajo-analisis-noticia/?ref=ecr [Accesses 1 05 cases of COVID-19. Chaos, Solitons and Fractals
2020]. 135 (2020) 109866.
[12] IDL Reporteros. (2020). Los muertos que el [23]Mauricio J. (2007). Introducción al análisis
Gobierno no cuenta. Consultado el 5 de mayo de series temporales. Madrid: Universidad
del 2020. URL: https://www.idl- complutense de Madrid.
reporteros.pe/los-muertos-que-el-gobierno-no- [24] Guirao, A. (2020). Entender una epidemia.
cuenta/ El coronavirus en España, situación y escenarios.
[13] Coutin Marie, Gisele. "Utilización de [Online]. Available
modelos ARIMA para la vigilancia de https://digitum.um.es/digitum/bitstream/1020
enfermedades transmisibles." Revista Cubana 1/88621/1/Entender%20una%20epidemia_Guir
de Salud Pública 33.2 (2007): 0-0.
ao2020.pdf [Accesses 15 05 2020].
[14] Bedoya Luza, S. L. (2018). Modelamiento
univariado del número de defunciones infantiles
producidas por infecciones respiratorias agudas,
a través de la metodología Box-Jenkins, puno
2008-2016.
[15] Ocaña-Riola, R. (2004). Eficacia del análisis
de series temporales para la planificación
sanitaria del cáncer en España. Atención
Primaria, 34(1), 15-19.
[16] León-Álvarez, A. L., Betancur-Gómez, J. I.,
Jaimes-Barragán, F., & Grisales-Romero, H.
(2016). Clinical and epidemiological rounds.
Time series. Iatreia, 29(3), 373-381.
[17] Cárdenas, C., Sovier, C., Pérez, U., &
GONZÁLEZ, C. S. (2014). Consultas de
urgencia general y por causa respiratoria en la
Red de establecimientos del Sistema Nacional
de Servicios de Salud (SNSS): un modelo
predictivo en el Servicio de Salud de
Chiloé. Revista chilena de enfermedades
respiratorias, 30(3), 133-141.
[18] Sistema Informático Nacional de
Defunciones (SINADEF). (2020). Tablero de
control. [Online]. Available
http://www.minsa.gob.pe/reunis/data/defunci
ones_registradas.asp [Accesses 18 05 2020].
[19] Ministerio de Salud (Minsa). (2020).
Situación actual “COVID-19” al 31 de marzo del
2020. [Online]. Available
https://www.dge.gob.pe/portal/docs/tools/cor
onavirus/coronavirus310320.pdf [Accesses 18
05 2020].
[20] Box G. E. P. Jenkins G. M. (1970). Time
series analysis: forecasting and control, 1976.
New York. EEUU. ISBN: 0-8162-1104-3.
[21] Peña, D. (2010) Análisis de series
temporales. Alianza Editorial S.A. Madrid 2005.
p.179-194.
[22]Sarbjit S. Kulwinder S. Jatinder K. (2020)
Development of new hybrid model of discrete
wavelet decomposition and autoregressive
integrated moving average (ARIMA) models in
application to one month forecast the casualties
Highlights (for review)

Highlights

- A correct implementation of the ARIMA model allows adequate inferences to be made about
unknown or unexplored phenomena in the field of biomedical science and statistical.

- 3,213 deaths have occurred in Peru in April 30. MINSA has reported 1,124 deaths which means
that our model has detected an increase of deaths by 185.9%.

- So far, no studies have been found in Peru that have used the set of models mentioned
previously with respect to the predictions of death rates from COVID-19.
Supplementary Material for online publication only
Click here to download Supplementary Material for online publication only: Dead Peru.xlsx

Das könnte Ihnen auch gefallen