COVID-CT-MD: COVID-19 Computed Tomography (CT) Scan Dataset Applicable in Machine Learning and Deep Learning
Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, Farnoosh Naderkhani, Moezedin Javad Rafiee, Anastasia Oikonomou, Faranak Babaki Fard, Kaveh Samimi, Konstantinos N. Plataniotis, Arash Mohammadi
CCOVID-CT-MD: COVID-19 Computed Tomography(CT) Scan Dataset Applicable in Machine Learningand Deep Learning
Parnian Afshar , Shahin Heidarian , Nastaran Enshaei , Farnoosh Naderkhani ,Moezedin Javad Rafiee, MD , Anastasia Oikonomou, MD , Faranak Babaki Fard, MD ,Kaveh Samimi, MD , Konstantinos N. Plataniotis , and Arash Mohammadi Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, Canada Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, Canada Department of Medicine and Diagnostic Radiology, McGill University Health Center-Research Institute, Montreal,QC, Canada Department of Medical Imaging, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, Canada Faculty of Medicine, University of Montreal, Montreal, QC, Canada Department of Radiology, Iran university of medical science, Tehran, Iran Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada * corresponding author: Arash Mohammadi ([email protected]) ABSTRACT
Novel Coronavirus (COVID-19) has drastically overwhelmed more than 200 countries affecting millions and claiming almost 1million lives, since its emergence in late 2019. This highly contagious disease can easily spread, and if not controlled in atimely fashion, can rapidly incapacitate healthcare systems. The current standard diagnosis method, the Reverse TranscriptionPolymerase Chain Reaction (RT- PCR), is time consuming, and subject to low sensitivity. Chest Radiograph (CXR), the firstimaging modality to be used, is readily available and gives immediate results. However, it has notoriously lower sensitivity thanComputed Tomography (CT), which can be used efficiently to complement other diagnostic methods. This paper introduces anew COVID-19 CT scan dataset, referred to as COVID-CT-MD, consisting of not only COVID-19 cases, but also healthy andsubjects infected by Community Acquired Pneumonia (CAP). COVID-CT-MD dataset, which is accompanied with lobe-level,slice-level and patient-level labels, has the potential to facilitate the COVID-19 research, in particular COVID-CT-MD can assistin development of advanced Machine Learning (ML) and Deep Neural Network (DNN) based solutions.
Background & Summary
Since its first emergence in late 2019, novel Coronavirus (COVID-19) has drastically changed the world, impacting severalaspects of the modern life. According to the World Health Organization (WHO), as of September 2020, more than 200countries have confirmed positive COVID-19 cases, leading to more than 30 million cases and almost 1 million reportedfatalities. Considering statistics and impacts together with the fact that COVID-19 can easily spread if infected cases arenot isolated/treated in a timely fashion, sensitive and accessible diagnosis systems are of significant importance. ReverseTranscription Polymerase Chain Reaction (RT- PCR) , which is currently considered as the gold standard diagnosis technique,suffers from relatively low sensitivity . Furthermore, It was not easily accessible in the beginning of epidemic in most of thecountries . More importantly, this test is time consuming that is not desirable as time is a critical factor in isolating, treating, andpreventing the transition of COVID-19. Being able to identify COVID-19-related respiratory complications, Chest Radiographs(CXR) , can play an important complementary role for the RT- PCR test to asses complications. Although, CXR can actas a quantitative method to assess the extent of COVID-19 involvement and estimate the risk of Intensive Care Unit (ICU)admission, it still has lower sensitivity compared to Computed Tomography (CT) . Due to high sensitivity and rapid access,chest CT plays a significant role in diagnosis and management of COVID-19 and has been recognized as the most sensitiveimaging modality to detect complications.Despite the high potential of CT in contributing to the COVID-19 research and clinical usage, publicly available datasets aremostly limited to a few number of cases, are not accompanied with other types of respiratory diseases to facilitate comparisons,and are not associated with suitable labels. Furthermore, cases may be collected from different sources with different imagingprotocols, limiting a unified study. In a few identified datasets, available CT scans are limited to only infected slices, rather than a r X i v : . [ ee ss . I V ] S e p he complete volume. Another important aspect that should be considered in the available datasets is that whether labels areavailable in a patient-level, slice-level, and lobe-level fashion. The later can further contribute to identify the location of theCOVID-19 infection. Finally, different types of labels and information, suitable for different tasks, are provided in identifieddatasets. Table 1 provides an overview of the available datasets along with the provided COVID-19 related information.The introduced COVID-19 CT scan dataset, referred to as the COVID-CT-MD, is applicable in Machine Learning (ML)and deep learning studies of COVID-19 classification. In particular, COVID-CT-MD dataset consists of 171 confirmed positiveCOVID-19 cases (gathered from 2020/02/23 to 2020/04/21), 76 normal cases (gathered from 2019/01/21 to 2020/05/29), and60 Community Acquired Pneumonia (CAP) cases (gathered from 2018/04/03 to 2019/11/24). All these cases are collected fromBabak Imaging Center in Tehran, Iran, and labeled by three experienced radiologists in patient-level, slice-level, and lobe-levelmanners. Patient-level label refers to a single diagnosis assigned to the subject, whereas slice-level and lobe-level refer toidentifying slices and lobes demonstrating infection, respectively. More importantly, the whole CT volume is available for allthe subjects. COVID-CT-MD is presented in Table 1, along with the previous datasets, to highlight its differences. RegardingReference , we would like to mention that while this Reference provides only COVID-19 and normal cases, COVID-CT-MDprovides CAP cases additionally. Furthermore, COVID-CT-MD is the only classification-related dataset that contains lobe-levelinformation, which can significantly improve and contribute to the localization and analysis of the COVID-19 infection. Methods
This section provides a description of the data collection procedure, inclusion criteria, and de-identification. Furthermore,detailed statistics of the data is presented to facilitate its usage. More importantly, applicability of the COVID-CT-MD datasetfor development of ML/DNN solutions is explained. This section is concluded by describing the possible limitations of theprovided dataset. This research work is performed based on the policy certification number 30013394 of Ethical acceptabilityfor secondary use of medical data approved by Concordia University, Montreal, Canada. Furthermore, informed consent isobtained from all the patients.
Data Collection
The COVID-CT-MD dataset contains volumetric chest CT scans of 171 patients positive for COVID-19 infection, 60 patientswith CAP, and 76 normal patients. COVID-19 cases are collected from February 2020 to April 2020, whereas CAP cases andnormal cases are collected from April 2018 to December 2019 and January 2019 to May 2020, respectively, in Babak ImagingCenter, Tehran, Iran. Diagnosis of COVID-19 infection is based on positive real-time Reverse Transcription Polymerase ChainReaction (rRT-PCR) test results, clinical parameters, and CT scan manifestations identified by three thoracic radiologists, withmore than 20, 18, and 25 years of experience in thoracic imaging, respectively. Labels provided by the three radiologists showedhigh degree of agreement (more than 90%). Diagnosis for CAP and normal cases was confirmed using clinical laboratory tests,and CT scans. A subset of 55 COVID-19, and 25 CAP cases were analyzed by the radiologists to identify and label sliceswith evidence of infection. The labeled subset of the data contains 4,993 number of slices demonstrating infection and 18,416number of slices without infection.All images are axial, with a reconstruction matrix (output size of the images) of 512 × CT Acquisition Care in The Medical Imaging Department
As COVID-19 is highly contagious, all the staff of the medical imaging department involved in the CT acquisition are providedwith personnel protective equipment (PPE). More importantly, there is a minimum of 5-minute time slack between twoconsecutive CT scans, allowing enough time to sanitize the CT scanner.
Data Inclusion and Exclusion Criteria
All cases with confirmed clinical diagnosis are included in the dataset. Nevertheless, during the data collection procedure, therewere some cases related to the late 2019, with manifestations similar to those of COVID-19. However, as the first COVID-19case in Iran is reported in early February 2020, these cases were excluded from the dataset. Furthermore, according to theradiologists’ assessment, images with poor quality and visible artifacts were excluded. e-identification
To respect the patients’ privacy, we de-identified all the CT studies by removing the patients’ name, birthday, and the name andaddress of the imaging center, as well as the operators’ name from their headers. Some helpful information including patients’gender and age, the scanner type, and the image acquisition settings has been kept to preserve the statistical characteristics ofthe dataset.
Data Statistics
The demographic distribution of the dataset describing the gender and age distributions is illustrated in Table 4 and Figure 2. Asshown in Figure 2(a), males outnumbered females in this dataset. The boxplot in Figure 2(b) represents the important statisticalparameters of the patients’ age distribution. As shown in this boxplot, normal cases are mainly distributed in lower ages, whileCAP cases are distributed in a wide range of ages with a higher average age.As previously stated, part of the dataset is analyzed and the slice-level labels are extracted. The number of labeled casesand slices demonstrating infection are presented in Table 5. Infection ratio in this table represents the ratio of the slicesdemonstrating infection to the total number of slices in a CT scan, which varies for different cases based on the severity andstage of the disease. The minimum and maximum values for the infection ratio in the labeled dataset are presented in Table 5.The distribution of the Infection Ratio is also illustrated by the boxplots in Figure 3(a), which demonstrate a higher infectionratio in COVID-19 cases compared to CAP cases. The histogram of the Infection Ratio values is illustrated in Figure 3(b).In addition to the described slice-level labels, the detailed distribution of infection in each lobe of the lung is provided bythe radiologists. Table 6 indicates the number of cases and slices with infection demonstrated in specific lung regions. Similarto Figure 3, where the infection ratio was presented for the total slices with infection in the lung, the average of lobe infectionratios are presented in Figure 4, illustrating the average ratio of slices demonstrating infection in a particular lobe to the totalnumber of slices in a CT scan. As evident in Table 6 and Figure 4, the average infection ratio in the lower lobes is higher inboth COVID-19 and CAP cases compared to other lung regions in our labeled dataset.
Limitations
Although all cases and labels are confirmed by three experienced radiologists, we would like to describe a few limitations thatthe data users may encounter. These limitations are as follows: • The slice and lobe labeling processes focus more on regions with distinctive manifestations rather than minimal findings. • Not all the COVID-19 patients have confirmed positive RT-PCR result, as this test was not publicly accessible in Iranat the time of the first emergence of the COVID-19. Furthermore, the high load of patients in need of COVID-19examination, did not allow for an inclusive RT-PCR test. The diagnosis of some patients in the COVID-CT-MD dataset isconfirmed based on the CT findings, as well as the clinical results. • Although most of the cases with low quality CT scans are excluded, there may still be some cases with mild motionartifact which is inevitable, since COVID-19 patients suffer from dyspnea. • During the slice and lobe labeling process, some suspicious areas adjacent to the chest wall and diaphragm are not labeledas “infected”, due to their poor distinction.
Data Records
The diagram in Figure 5 shows the structure of the COVID-CT-MD dataset shared through Figshare . COVID-19, CAP andNormal subjects are placed in separate folders, within which patients are arranged in folders, followed by CT scan slices inDICOM format. “Index.csv” is related to the patients having slice-level and lobe-level labels. The indices given to patientsin “Index.csv” file are then used in “Slice-level-labels.npy” and “Lobe-level-labels.npy” to indicate the slice and lobe labels.“Slice-level-labels.npy” is a 2D binary Numpy array in which the existence of infection in a specific slice is indicated by 1and the lack of infection is shown by 0. In “Slice-level-labels.npy”, the first dimension represents the case index and thesecond one represents the slice numbers. “Lobe-level-labels.npy” is a 3D binary Numpy array in which the existence ofinfection in a specific lobe and slice is determined by 1 in the corresponding element of the array. Like the slice-level array, in“Lobe-level-labels.npy”, the two first dimensions represent the case index and slice numbers respectively. The third dimensionshows the lobe indices which are specified as follows: • • COVID-CT-MD dataset is accessible through Figshare (https://figshare.com/s/c20215f3d42c98f09ad0) • • Technical Validation
Two noteworthy parameters in the studies using CT scans are the quality control and calibration of the scanning device. Thelongest time period between the scanner auto-calibration and the study in the COVID-CT-MD dataset is 1 day, which ensurescalibrated and accurate performance of the scanning device. Furthermore, there is an annual SIEMENS quality control thatensures the absence of ring artifacts in the acquired CT scans.
Usage Notes
With the increasing number of COVID-19 patients, healthcare workers are overwhelmed with a heavy workload, loweringtheir concentration for a proper diagnosis. Accurate and timely COVID-19 diagnosis, on the other hand, is a critical factor inpreventing the disease transition, treatment, and resource allocation. Machine Learning (ML), in particular Deep Learning(DL) based on Deep Neural Networks (DNN), is shown to be practical and effective in COVID-19 diagnosis and severityassessment. The COVID-CT-MD dataset is specifically designed to facilitate application of ML/DL in COVID-19-related tasks.In particular, this dataset can be used towards: • A patient-level binary classification to distinguish COVID-19 from all other cases. • A patient-level multi-class classification to identify COVID-19, CAP, and normal subjects. • A slice-level and lobe-level classification to separate infected slices and lobes from non-infected ones for furtheranalysis. • Slice-level and lobe-level labels can be used as additional inputs to segmentation models , to focus on only infectedslices. • Slice-level and lobe-level labels can be used in generative models to generate artificial COVID-19 images, towardsincreasing the security of the healthcare systems and developing attack resilient solutions .We have utilized the COVID-CT-MD dataset in a recent study , to classify subjects as COVID-19 or non-COVID (Normal andCAP). The model proposed in this study, referred to as the COVID-FACT, consists of two stages. In the first stage, infectedslices (COVID-19 and CAP) are separated from healthy ones, through a developed Capsule Network. Consequently, in thesecond stage, infected slices are used to classify patients as COVID-19 or non-COVID, through another Capsule Network andan average voting approach. While the first stage exploits the provided slice-level labels, the patient-level ones are used inthe second stage. Data users are encouraged to train and test their methods on the COVID-CT-MD dataset and compare theirresults, accordingly. Code availability
The Python code used to generate the statistical analysis and plots is shared within the same Figshare link(https://figshare.com/s/c20215f3d42c98f09ad0).
References Xu, X. et al.
A deep learning system to screen novel coronavirus disease 2019 pneumonia.
Engineering https://doi.org/10.1016/j.eng.2020.04.010 (2020). Wang, S. et al.
A deep learning algorithm using ct images to screen for corona virus disease (covid-19). Preprint atmedRxiv (2020). Ai, T. et al.
Correlation of chest ct and rt-pcr testing for coronavirus disease 2019 (covid-19) in china: A report of 1014cases.
Radiology , https://doi.org/10.1148/radiol.2020200642 (2020). Cozzi, D. et al.
Chest x-ray in new coronavirus disease 2019 (covid-19) infection: findings and correlation with clinicaloutcome.
La radiologia medica https://doi.org/10.1007/s11547-020-01232-9 (2020). . Borakati, A., Perera, A., Johnson, J. & Sood, T. Chest x-ray has poor diagnostic accuracy and prognostic significance incovid-19: A propensity matched database study (2020). Rahimzadeh, M., Attar, A. & Sakhaei, S. A fully automated deep learning-based network for detecting covid-19 from anew and large lung ct scan dataset. Preprint at medRxiv (2020). Ozturk, T. et al.
Automated detection of covid-19 cases using deep neural networks with x-ray images.
Comput Biol Med https://doi.org/10.1016/j.compbiomed.2020.103792 (2020). Afshar, P., Heidarian, F., Sh. Naderkhani, Oikonomou, A., Plataniotis, K. & Mohammadi, A. Covid-caps: A capsulenetwork-based framework for identification of covid-19 cases from x-ray images. Preprint at arXiv:2004.02696 (2020). Yan, T. et al.
Automatic distinction between covid-19 and common pneumonia using multi-scale convolutional neuralnetwork on chest ct scans.
Chaos Solitons Fractals https://doi.org/10.1016/j.chaos.2020.110153 (2020).
Fan, D. et al.
Inf-net: Automatic covid-19 lung infection segmentation from ct images. Preprint at arXiv:2004.14133v4(2020).
Mirsky, Y., Mahler, T., Shelef, I. & Elovici, Y. Ct-gan: Malicious tampering of 3d medical imagery using deep learning.Preprint at arXiv:1901.03597v3 (2020).
Heidarian, S. et al.
Covid-fact: A fully-automated capsule network-based framework for identification of covid-19 casesfrom chest ct scans. Submitted to Frontiers in Artificial Intelligence (2020).
Bjorke, H. Covid-19 segmentation dataset.
MedSeg http://medicalsegmentation.com/covid19/ (2020).
Jun, M. et al.
Covid-19 ct lung and infection segmentation dataset (version verson 1.0).
Zenodo http://doi.org/10.5281/zenodo.3757476 (2020).
Cohen, J., Morrison, P. & Dao, L. Covid-19 image data collection. Preprint at arXiv:2003.11597v1 (2020).
Morozov, S. et al.
Mosmeddata: Chest ct scans with covid-19 related findings dataset. Preprint at arXiv:2005.06465(2020).
Zhao, J., Zhang, Y., He, X. & Xie, P. Covid-ct-dataset: a ct scan dataset about covid-19. Preprint at arXiv:2003.13865(2020).
Soares, E., Angelov, P., Biaso, S., Higa Froes, M. & Kanda Abe, D. Sars-cov-2 ct-scan dataset: A large dataset of realpatients ct scans for sars-cov-2 identification. Preprint at medRxiv (2020).
Acknowledgements
This work was partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada through theNSERC Discovery Grant RGPIN-2016-04988.
Author contributions statement
P.A. drafted the manuscript together with A.M., and analyzed the results. Sh.H. performed the statistical analysis and organizedthe dataset. M.J.R., F.B.B., and K.S. collected the dataset and revised the medical information in the paper. N.E., F.N., A.O.,K.N.P. edited and revised the manuscript. A.M. supervised the study.
Competing interests
The authors declare no competing interests.
Figures & Tables able 1.
Available COVID-19 CT scan datasets. NA stands for not available.
Number of cases Label type Data Source CT volume Label Level
Dataset COVID CAP Normal Classification Segmentation Multiple Single Available Not available Patient-level Slice-level Lobe-levelReference
49 NA NA (cid:51) (cid:51) (cid:51) (cid:51)
Reference
20 NA NA (cid:51) (cid:51) (cid:51) (cid:51)
Reference
20 NA NA (cid:51) (cid:51) (cid:51) (cid:51)
Reference
856 NA 254 (cid:51) (cid:51) (cid:51) (cid:51)
Reference
216 NA 55 (cid:51) (cid:51) (cid:51) (cid:51)
Reference
60 NA 60 (cid:51) (cid:51) (cid:51) (cid:51)
Reference
95 NA 282 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
COVID-CT-MD 171 60 76 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
Table 2.
CT scan settings used to acquire the COVID-CT-MD dataset.
Diagnosis Slice Thickness (mm) Peak Kilovoltage (kVp) Exposure Time (ms) X-ray Tube Current (mA) SID (mm) SOD (mm) Exposure values (mAs)
COVID-19 −
130 600 153 −
343 940 535 91 . − . CAP −
120 420 −
600 94 −
500 940 − −
570 56 . − . Normal −
343 940 535 79 . − . Figure 1.
The distribution of the Exposure values for COVID-19, CAP and Normal cases.
Table 3.
The statistical parameters (mean and standard deviation) of the Exposure values.Diagnosis Exposure mean Exposure standard deviation
COVID-19 .
15 32 . CAP .
96 43 . Normal .
12 35 . able 4. Gender and age distribution in COVID-CT-MDDiagnosis Cases Sex Age (year)
COVID-19
171 108 M/63 F 51 . ± . CAP
60 35 M/25 F 57 . ± . Normal
76 40 M/36 F 43 . ± . (a) (b) Figure 2. (a) The number of cases separated by the patient’s gender. (b) The distribution of age for COVID-19, CAP and Normal cases.
Table 5.
The number of cases, Slices, and Infection Ratio in the labeled dataset.Diagnosis Cases Slices Demonstrating Infection Slice without infection Infection Ratio
COVID-19
55 3815 4377 7 . − . CAP
25 1178 2718 7 . − . Table 6.
Number of cases and slices, respectively, demonstrating infection in each lobe. LLL: Left Lower Lobe – LUL: LeftUpper Lobe – RLL: Right Lower Lobe and Lingula – RML: Right Middle Lobe – RUL: Right Upper LobeDiagnosis LLL LUL RLL RML RUL
COVID-19
CAP
Total a) (b)
Figure 3. (a) The distribution of the Infection Ratio in the labeled dataset for COVID-19 and CAP cases. (b) The histogram of theInfection Ratio in the labeled dataset for COVID-19 and CAP cases.
Figure 4.
Average Infection Ratio in each lobe of the lung for COVID-19 and CAP cases in the labeled dataset. ain FolderCOVID-19 subjectsSubject-IDSlice-ID.dcmCap subjectssubject-IDSlice-ID.dcmNormal subjectssubject-IDSlice-ID.dcmIndex.csvSlice-level-labels.npyLobe-level-labels.npy
Figure 5.