Data Mining Approach to Analyze Covid19 Dataset of Brazilian Patients
DData Mining Approach to AnalyzeCovid19 Dataset of Brazilian Patients
Josimar Edinson Chire Saire [email protected] Institute of Mathematics and Computer Science (ICMC)University of Sao Paulo (USP)Sao Carlos, SP, Brazil
Abstract.
The pandemic originated by coronavirus(covid-19), name coinedby World Health Organization during the first month in 2020. Actually,almost all the countries presented covid19 positive cases and govern-ments are choosing different health policies to stop the infection andmany research groups are working on patients data to understand thevirus, at the same time scientists are looking for a vacuum to enhanceimnulogy system to tack covid19 virus. One of top countries with moreinfections is Brazil, until August 11 had a total of 3,112,393 cases. Re-search Foundation of Sao Paulo State(Fapesp) released a dataset, it wasan innovative in collaboration with hospitals(Einstein, Sirio-Libanes),laboratory(Fleury) and Sao Paulo University to foster reseach on thistrend topic. The present paper presents an exploratory analysis of thedatasets, using a Data Mining Approach, and some inconsistencies arefound, i.e. NaN values, null references values for analytes, outliers on re-sults of analytes, encoding issues. The results were cleaned datasets forfuture studies, but at least a 20% of data were discarded because of nonnumerical, null values and numbers out of reference range.
Keywords: data mining, data science, covid-19, coronavirus, brazil,sars-cov2, south america
The outbreak of Coronavirus(Covid19) started with first cases on December2019, in Wuhan(China). The first reported case[4] in South America was inBrazil on 26 February 2020, in So Paulo city. The strategy to stop the infectionsin the country was a partial lockdown to avoid the propagation of the virus.On 28 January 2020, Ministry of Health of Brazil reported a suspected caseof Covid19 in Belo Horizonte, Minas Gerais state, recently one student returnedfrom China [1], [13]. The same day were reported two suspected cases in PortoAlegre and Curitiba [5]. The first confirmed COVID-19 case [11] were reportedin Brazil, a man of 61-year-old who returned from Italy. The patient was testedin Israelita Einstein Hospital in Sao Paulo state. On 14 May[12], more than 200000 cases were confirmmed, this number double during the first days of May. a r X i v : . [ c s . C Y ] A ug Josimar Edinson Chire Saire [email protected]
Until August 11, the numbers of Brazil are: total of 3,112,393 cases, with anincreasing rate of new cases of 44,255(+1.4%) and a total of 2,243,124 recoveredcases.Nowdays, many scientists are working around coronavirus covid19, but search-ing for conducted studies in South America, there is only a few number. Aftera searching in IEEX Xplorer using coronavirus, covid19 terms, one paper withBrazilian Affiliation is found [18], related to data augmentation for covid19 de-tection. Considering a preprint repository related to Medicine(Medxriv), usingterms: covid19, coronavirus, data mining more than 50 papers are found.The table 1 presents the top 10 results of MedxRiv query. Four of this papersis a conducted study for South America countries and there is any work analyzingBrazilian context. In spite of, there is 4 papers with Brazilian Affiliation. Author Title Countryof Study Keywords Affiliation [8] Covid19 Surveillance in Peruon April using Text Mining Peru Natural Language Processing, Text Mining,People behaviour, Coronavirus, Covid-19 University of Sao Paulo(Brazil),Universidad Privada del Norte(Peru)[9] Text Mining Approachto Analyze CoronavirusImpact: Mexico City as Case of Study Mexico Natural Language Processing, Text Mining,People behaviour, Coronavirus, Covid-19 University of Sao Paulo(Brazil),Tecnologico Nacional del Mexico /Instituto Tecnologico de Matamoros(Mexico)[6] How was the Mental Health ofColombian people on Marchduring Pandemics Covid19? Colombia Not available University of Sao Paulo(Brazil),[10] Mining Twitter Data onCOVID-19 for Sentiment analysisand frequent patterns Discovery Algiers tweets Analytics, COVID-19, sentimentanalysis, frequent patterns, associationrules mining University of Science andTechnology Houari Boumedine(Algiers)[7] Infoveillance based onSocial Sensors to Analyzethe impact of Covid19in South American Population SouthAmerica(not Brazil) Not available University of Sao Paulo(Brazil),[2] Spread of SARS-CoV-2 Coronaviruslikely constrained by climate Notapplicable Not available National Museum of NaturalSciences (Spain),University of vora (Portugal),University of Helsinki (Finland)[3] The Role of Host Genetic Factorsin Coronavirus Susceptibility:Review of Animal andSystematic Review of Human Literature Notapplicable Coronavirus; COVID-19;Host genetic factors ; SARS-CoV-2 University of Florida College ofVeterinary Medicine(Usa),National Institutes of Health(Usa),Johns Hopkins Bloomberg Schoolof Public Health ,(Usa)[16] Early epidemiological assessmentof the transmission potentialand virulence of coronavirusdisease 2019 (COVID-19)in Wuhan City: China,January-February, 2020 China Not available University Yoshida(Japan),Kyoto University(Japan),Georgia State University(Usa)[14] Analysis of Epidemic Situation ofNew Coronavirus Infection at Homeand Abroad Basedon Rescaled Range (R/S) Method China Not available Sichuan Academy of Social Sciences(China)[19] State heterogeneity of human mobilityand COVID-19 epidemics inthe European Union EuropeanUnion Coronavirus 2019, epidemics, geographic,trends, public health intervention Shanghai Jiao Tong UniversitySchool of Medicine(China),University at Buffalo(Usa),Yale University School of Medicine(Usa)
Table 1.
Ten results of Medrxiv Query about covid19 papers in South America
Considering, the previous evidence it is necessary to conduct studies withBrazilian data, then the initiative of Fapesp is valuable to foster research oncovid19 topic. The actual paper uses Data Mining Approach to perform anexploratory analysis of the dataset of Brazilian patients of Sao Paulo State. Themethodology to explore data is presented in Section 2, the experiments andresults in Section 3. Conclusion states in Section 4, final recommendations andfuture work are presenten in Section 5, 6. Data extracted from website: https://virusncov.com/ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 3
The conducted work follows a methodology inspired in CRISP-DM[17]. Theimage 1 presents the flow between the phases of the exploration.
Exploring Data
Data Exploration
Pre-processing
Cleaning
Analysis
Filttered Data
Visualization
Question Graphics
Fig. 1.
Methodology
This step involves: check format files, open the files using a Language Program-ming or a tool. Review number of registers or rows per each file. Check existenceof null values, check kind of each variable or field. For this step, Python LanguageProgramming and pandas package are used to manipulate the data.
This step is related how to deal with data before of generate graphics for analysis. – If a specific variable must be numerical, but there is string values, so it isdiscarded – If null values are found, a discarding process must be considered. – If range reference for one exam, analytes is null then the analysis is notpossible.
Using clean data is possible to answer some questions related to age distribu-tion, sex distribution, distribution of results to detect anomalies or outliers. Thequestions can require a kind of specific graphic to suppot analysis.
Considering distribution of few classes, a pie chart is useful to check propor-tions, subsection 3.3, 3.8 . For age distribution, bar plot can show how is thedistribution, see subsection 3.4, 3.5, 3.6. The analysis is dozen of values can besupported for boxplot graphics, in subsection 3.9, 3.10.
Josimar Edinson Chire Saire [email protected]
The release of the datasets is the result of collaboration between Research Foun-dation (FAPESP)[15], Fleury Institute, Israelita Albert Einstein Hospital, Sirio-Libanes Hospital and the University of Sao Paulo. The goal is to contribute andpromote research related to Covid19. The datasets share the data dictionariesof Patients(see Tab. 1), Test (Tab. 2).
Table 2.
Data Dictionary of Patient Dataset- Einstein, Fleury, Sirio-Libanes Hospital
Variable Description Format Content
ID PACIENTE Unique identification of patient Alphanumeric characters String, key patientIC SEXO Genre Alphanumeric character F - Feminino(Female)M - Masculino(Male)AA NASCIMENTO Birth date Number Example: 1959(*) AAAA - for people was born before or equel 1930CD PAIS Country of residence Alphanumeric Exemplo: BRCD UF Federal State Identifier Alphanumeric characters AC - Acre, AL - Alagoas, AM - Amazonas, AP - Amapa, BA - Bahia,CE - Cear, DF - Distrito Federal, ES - Espirito Santo, GO - Gois,MA - Maranho, MG - Minas Gerais, MS - Mato Grosso do Sul,MT - Mato Grosso, PA - Par, PB - Paraba, PE - Pernambuco,PI - Piau, PR - Paran, RJ - Rio de Janeiro, RN - Rio Grande do Norte,RO - Rondnia, RR - Roraima, RS - Rio Grande do Sul,SC - Santa Catarina, SE - Sergipe, SP - So Paulo, TO - TocantinsCD MUNICIPIO Residence City Alphanumeric Example: SAO PAULO, CAMPINAS, SANTO ANDREMMMM - for the lowest occurrencesCD CEP Postal Code Number (**) First five digits of Postal Code, (**) CCCC - for low number of ocurrences
Table 3.
Data Dictionary of Tests - Einstein, Fleury, Sirio-Libanes Hospital
Variable Name Description Format Content
ID PACIENTE Unique identification of patient Alphanumericcharacter String, patient keyDT COLETA Exam collection date Date (yyyy/MM/dd) DateDE ORIGEM Origin of patient Alphanumericcharacter (4) HOSP Exam made in a hospitalDE EXAME Description of Exam Alphanumeric Example: HEMOGRAMA(blood count)DE ANALITO Analyte description Alphanumeric Example: Eritrcitos(Erythrocytes),Leuccitos(Leukocytes), Glicose(Glucose)DE RESULTADO Result of exam,related to DE ANALITO Alphanumeric If DE ANALITO requires numerical values,Integer ou FloatIf DE ANALITO requeries qualitative,String with restrict domainCD UNIDADE Unit of measurement Alphanumeric StringExemplo: g/dL (grams por deciliter)DE VALOR REFERENCIA Reference valuesfor DE RESULTADO Alphanumeric String - Reference value for de analito inthe population
MinV alue , MaxV alue
No Detectado(Not detected)/Detectado(Detected)Example for glucose: 75 to 99Example for progesterone: until 89
The size of dataset are presented in Table 3 for three data sources. SL Hospitalprovided a dataset about outcomes of the patients.
Table 4.
Features of Dataset
Einstein Hospital Fleury SL HospitalPatient(size)
Test(size)
Test(Dates)
Outcome(size) - - 9,634
Outcome(Dates) - - 2020-02-26to 2020-06-29 ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 5
This subsection present some graphics to describe data and let posterior analysis,besides the requeriment of some graphics related to distribution, i.e. bar plot,boxplot.
Description of datasets
The Figure 2 is presented with counting values,unique values, top for each field. The name of columns were transformed tolowercase to have an uniform name of fields.
Fig. 2. (a) Einstein, (b) Fleury and (c) SL Datasets Description – Figure 3.b presents a different number of id paciente in patient dataset andexam dataset, 129596(patient) 129595(exam). – Einstein and SL Hospitals( cd pais ) presents people living in countries dif-ferent than Brazil. – The most frequent age of patients is: 38(Einstein, Fleury) and 34(SL). – Female patients are higher in number in Einstein, Fleury. – Most frequent cd uf, cd municipio is Sao Paulo State or city and CCCC ismost common in Postal Code, so this places do not have meaningful numberof ocurrences. – Einstein and Fleury have a unique de origem: Hosp, Lab respectively. ButSL Hospital has 56 different. – The exam hemograma(blood count) is the most frequent in the datasets, andde analito more frequent in Eistein, Fleury are related to
Covid19 . – Eistein has the lowest number of different de exame(61), de analito(127).Fleury has the highest de exame(722), de analito(978). SL has de exame(478),de analito(652). Therefore, numer of de valor referencia are related. – SL Hospital presentes NaN(Not a number) values, then it is possible findNaN values in the datasets.
Female population is slightly bigger than male population in Einstein, Fleurybut SL presents male population bigger for 0.05%(29 people), see Fig. 3.
Josimar Edinson Chire Saire [email protected]
Fig. 3.
Sex Distribution(Einstein, Fleury and HL)
Datasets of Einstein, Fleury have younger patients from 0 to 14 until 89 but SLHospital only from 14 to older(86), this graphics are presented in Fig. 4
Fig. 4.
Age Distribution (Einstein, Fleury, SL)
The graphic Fig. 5 presents the number of collect exams per day and month,Einstein presents an increasing number from January to June, Flury a decreasingfrom January to April but a peak on May, June. Besides, SL Hospital has anincreasing from February to June.
To answer what were the most frequent exams during the month of each dataset,graphic Fig. 6 presents the 20 most frequents. – Three datasets has blood count exam on the top of each month. – Only Fleury has exams related to covid19 detection on April, May, June onthe top 5. ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 7
Fig. 5.
Date Distribution (Einstein, Fleury, SL) – There are many kind of exams related to covid19 for Hospital, i.e. PCR,Sorologia SARS-Cov-2/Covid19 (Einstein). Fleury has NOVO Coronavirus2019, Covid19 Anticorpos lgG, lgM, lgA and more. SL Hospital has Covid-19 PCR para Sars-Cov2 and a problem with encoding is detected in thisdataset. – For the previous reason, each dataset is studied separately.
Fig. 6.
Exam Distribution (Einstein, Fleury, SL)
Einstein and Fleury presents analytes related to covid19, i.e. resultado covid19,Covid19 deteccao por PCR, Covid19 material and more. Again, Fleury presentsa variety of names for analytes related to covid19. And SL Hospital does nothave any in the top 20(see Fig. 7).
Josimar Edinson Chire Saire [email protected]
Fig. 7.
Analyte per month(Einstein, Fleury, SL)
Considering analytes related to covid19, graphic 8 presents the number of de-tected/not detected during the months for Hospital Einstein. Fleury and SL donot have an standardized outputs of covid19 exams, therefore is not possible togenerate the graphics yet.
Fig. 8.
Analyte per month(Einstein)
Considering top 14 of de analito and de resultado, the graphic Fig. 9 is present-ing boxplot of the values of Einstein Hospital. It is necessary not to considerqualitative values, then only numerical values were used to build the plot. An-alyzing the graphic is remarkable to many outliers in many of analytes, then acleaning process is necessary. ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 9
Fig. 9.
Boxplot of top 14 analytes (Einstein)
Splitting data of covid19 detected and no detected, figure Fig. 10 is presented.Again, outliers are present in Fleury dataset. Red ones(detected), blue(not de-tected).
Fig. 10.
Boxplot of top 14 analytes(Fleury)
Using a cleaning process using standard deviation(std) is proposed, becausethe outliers are further than median and in normal case two or three timeshigher is considered an abnormal value but in this situation, to have a bettervisualization of boxplot was used 0.5*std(see Fig. 11) and 0.2*std(see Fig. 11)on Einstein dataset considering analytes with abnormal values.
Fig. 11.
Boxplot of Cleaned dataset of Analytes with Abnormal Values, 0.5*std0 Josimar Edinson Chire Saire [email protected]
The next graphics are created splitting Einstein dataset for genre. There is pres-ence of NaN values in the reference value then these analytes are discared forthe graphic, table 3.10 presents the no valid de analito, it is a total of 8.
Table 5.
No valid de analito for no valid reference range
De analito Unity RangeReference
Neutrfilos % nanDosagem de Glicose nan nanBasfilos % nanEosinfilos % nanMoncitos % nanLinfcitos % nanLeuccitos x10ˆ3/uL nanPlaquetas x10ˆ3/uL nan
Plotting the distribution(Fig. 12) for 30 most frequents analytes for men.
Fig. 12.
Men Analytes
The next graphic 13 present the distribution for positive cases of covid19.In the two previous images 12 and 13 is possible to observe a concentrationof outliers in the sides of the normal distribution, i.e. TGO, TGP, Creatinina,Neutrfilos ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 11
Fig. 13.
Men Analytes - Positives covid19 cases
And graphic 14 introduces the result after of cleaning values and consideringpatients with positive cases and the date when it was detected until it finishesor open(no date for discard test). Because the aim of the analysis is understandhow is the behaviour of the patients with positive diagnosis of covid19 duringthe active phase of virus, from the start until the end. Analyzing, Fig. 14, it ispossible to notice that the presence of outliers has disappeared, an exceptionwith Basfilos
Fig. 14.
Filtered Men Analytes - Positives covid19 cases
Finally, Table 3.10 presentss the steps used to clean data and generate Fig.14. First, only numerical values are considered, null values are discarded, and [email protected] values out of reference range are not considered. For checking if values are insideof reference range, it was manually because there was many reference values too,only the lowest and highest value were used to filter data. Then, the reductioncan be from 0.83 to 75.30 %. An initial number of exams was 108,152 and finalvalue after filtering 86,814 with a reduction of almost 20% of the available data.Now, dataset is ready to answer more question and the research can continue.
Table 6.
Reduction of Dataset de analito Initial OnlyNumericals Not null Range Reduction
Magnsio 2733 2733 2725 675 75.30TGO 1884 1884 1865 1799 4.51TGP 1887 1887 1873 748 60.36Clcio Inico mmol/L 3585 3585 3553 3494 2.54Neutrfilos
Total 108152 86814 19.73
Coronavirus pandemic is active in the world, scientist are working to understandhow to stop the virus, many areas are studying the covid19 impact in Heath,Economy therefore datasets related to patients are useful and important. Fapespinitiative to gather university and hospital is remarkable because it can fosterresearch on the topic.Real world datasets are not clean or ready for Data Mining or Data Sciencetasks then an exploratory phase is mandatory to see if data can be representativeor useful to answer questions. Then, many cleaning steps were necessary togenerate the final dataset and graphic, besides this cleaning step reduced theavailable dataset of men in 20%, with a maximum value of 75.30% for MagnesiumAnalyte, then it is possible a meanignful reduction of data is a cleaning task isperformed.Finally, share the process of analysis is useful for researchers interested toanalyze with this dataset, so it can save time, effort to future research. ata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 13
For researchers interested to work with these datasets, consider: – Check if range of dates for each dataset to know if this data is useful foryour study. – Sirio-Libanes Hospital has some issues related to encoding, this is the small-est dataset then you must analyze if it useful for analysis and search for theproblems to fix them. – Only Einsteing dataset has a standardized output for covid19 exams: de-tected or not detected. If you are from Computer Science or related field,this is better for your study. Because, Fleury has a variety of outputs, there-fore is necessary the presence or advice of one person related to Medicine toexplain you the different values. – If you want to automatize filtering considering reference range of values,remember there are many for many analytes, then the suggestion is checkthis manually to check if it is possible to code the process.
For further work, a crossing of data is proposed to improve the analysis consider-ing other variables, i.e. social-economic data, previous existence of health issuesrelated to patients, considering data of other hospital to enhance the study. Bythe other hand, a deep analysis will be performed with this new cleaned dataset.
Acknowledgement
The author wants to thank to Fabio Faria, professor of UNIFESP(Federal Uni-versity of So Paulo) for the invitation to analyze this dataset, to the team DS-Covid for the discussion about the generated graphics during the data analysistask, more news about future will be available in: https://dscovid.github.io/ .
References
1. Abril, E.: Ministrio da Sade confirma 3 casos suspeitosde coronavrus no Brasil (Jan 2020), https://web.archive.org/web/20200129042253/https://exame.abril.com.br/brasil/ministerio-da-saude-confirma-3-casos-suspeitos-de-coronavirus-no-brasil/
2. Araujo, M.B., Naimi, B.: Spread of sars-cov-2 coronavirus likely to be constrainedby climate. medRxiv (2020). https://doi.org/10.1101/2020.03.12.20034728,
3. Araujo, M.B., Naimi, B.: Spread of sars-cov-2 coronavirus likely to be constrainedby climate. medRxiv (2020). https://doi.org/10.1101/2020.03.12.20034728,
4. AS/COA: The Coronavirus in Latin America (Aug 2020), [email protected]
5. Braziliense, C.: Casos suspeitos de coronavrus so registrados em PortoAlegre e Curitiba (Jan 2020),
6. Chire Saire, J.E.: How was the mental health of colombianpeople on march during pandemics covid19? medRxiv (2020).https://doi.org/10.1101/2020.07.02.20145425,
7. Chire Saire, J.E.: Infoveillance based on social sensors to analyze theimpact of covid19 in south american population. medRxiv (2020).https://doi.org/10.1101/2020.04.06.20055749,
8. Chire Saire, J.E., Oblitas, J.: Covid19 surveillance in peru on april using textmining. medRxiv (2020). https://doi.org/10.1101/2020.05.24.20112193,
9. Chire Saire, J.E., Pineda-Briseno, A.: Text mining approach to ana-lyze coronavirus impact: Mexico city as case of study. medRxiv (2020).https://doi.org/10.1101/2020.05.07.20094466,
10. Drias, H.H., Drias, Y.: Mining twitter data on covid-19 for sen-timent analysis and frequent patterns discovery. medRxiv (2020).https://doi.org/10.1101/2020.05.08.20090464,
11. Folha: Brasil confirma primeiro caso do novo coronavrus (Jan2020),
12. Globo: Brasil tem 13.993 mortes e 202.918 casos confirma-dos de novo coronavrus, diz ministrio (May 2020), https://g1.globo.com/bemestar/coronavirus/noticia/2020/05/14/brasil-tem-13993-mortes-causadas-pelo-novo-coronavirus-diz-ministerio.ghtml
13. Globo: Ministrio investiga caso suspeito de coronavrus emMG e pede que viagens China sejam evitadas (Jan 2020), https://g1.globo.com/ciencia-e-saude/noticia/2020/01/28/ministerio-da-saude-confirma-caso-suspeito-de-coronavirus-em-mg.ghtml
14. Ji, X., Tang, Z., Wang, K., Li, X., Li, H.: Analysis of epidemic situation ofnew coronavirus infection at home and abroad based on rescaled range (r/s)method. medRxiv (2020). https://doi.org/10.1101/2020.03.15.20036756,
15. Mello, L.E., Suman, A., Medeiros, C.B., Prado, C.A., Rizzatti, E.G., Nunes,F.L.S., Barnab, G.F., Ferreira, J.E., S, J., Reis, L.F.L., Rizzo, L.V., Sarno,L., de Lamonica, R., Maciel, R.M.d.B., Cesar-Jr, R.M., Carvalho, R.: Open-ing Brazilian COVID-19 patient data to support world research on pandemics(Jul 2020). https://doi.org/10.5281/zenodo.3966427, https://doi.org/10.5281/zenodo.3966427
16. Mizumoto, K., Kagaya, K., Chowell, G.: Early epidemiological assess-ment of the transmission potential and virulence of coronavirus disease2019 (covid-19) in wuhan city: China, january-february, 2020. medRxivata Mining Approach to Analyze Covid19 Dataset of Brazilian Patients 15(2020). https://doi.org/10.1101/2020.02.12.20022434,
17. Shearer, C.: The crisp-dm model: The new blueprint for data mining. Journal ofData Warehousing (4) (2000)18. Waheed, A., Goyal, M., Gupta, D., Khanna, A., Al-Turjman, F., Pinheiro, P.R.:Covidgan: Data augmentation using auxiliary classifier gan for improved covid-19detection. IEEE Access , 91916–91923 (2020)19. Yuan, X., Hu, K., Xu, J., Zhang, X., Bao, W., Lynch, C.F., Zhang, L.: State hetero-geneity of human mobility and covid-19 epidemics in the european union. medRxiv(2020). https://doi.org/10.1101/2020.06.10.20127530,, 91916–91923 (2020)19. Yuan, X., Hu, K., Xu, J., Zhang, X., Bao, W., Lynch, C.F., Zhang, L.: State hetero-geneity of human mobility and covid-19 epidemics in the european union. medRxiv(2020). https://doi.org/10.1101/2020.06.10.20127530,