Deep learning to estimate the physical proportion of infected region of lung for COVID-19 pneumonia with CT image set
Wei Wu, Yu Shi, Xukun Li, Yukun Zhou, Peng Du, Shuangzhi Lv, Tingbo Liang, Jifang Sheng
DDeep learning to estimate the physical proportion of infected regionof lung for COVID-19 pneumonia with CT image set
Wei Wu , Yu Shi , Xukun Li , Yukun Zhou , Peng Du , Shuangzhi Lv ,Tingbo Liang , Jifang Sheng
1. State Key Laboratory for Diagnosis and Treatment of Infectious Diseases,National Clinical Research Center for Infectious Diseases, Collaborative InnovationCenter for Diagnosis and Treatment of Infectious Diseases, the First AffiliatedHospital, School of Medicine, Zhejiang University, Hangzhou 310003, China.2. Artificial Intelligence Lab, Hangzhou AiSmartVision Co., Ltd., Hangzhou,Zhejiang 310012, People’s Republic of China3. Department of Radiology The First Affiliated Hospital, School of Medicine,Zhejiang University, Hangzhou, Zhejiang 310003, People’s Republic of China4. Zhejiang Provincial Key Laboratory of Pancreatic Disease, Department ofHepatobiliary and Pancreatic Surgery, Innovation Center for the Study of PancreaticDiseases, the First Affiliated Hospital, School of Medicine, Zhejiang University,Hangzhou 310003, China.
Correspondence to Wei Wu, PhD, MDE-mail: [email protected]: 0000-0003-4657-4088 bstract
Utilizing computed tomography (CT) images to quickly estimate the severity of cases withCOVID-19 is one of the most straightforward and efficacious methods. However, due to thethree-dimensional structure and ambiguous edge of infected regions, it may be difficult todiagnose quantitatively for clinical physicians and radiologists.Two tasks were studied in this present paper. One was to segment the mask of intact lung in caseof pneumonia. Another was to generate the masks of regions infected by COVID-19. The masksof these two parts of images then were converted to corresponding volumes to calculate thephysical proportion of infected region of lung.A total of 129 CT image set were herein collected and studied. The intrinsic Hounsfiled value ofCT images was firstly utilized to generate the initial dirty version of labeled masks both for intactlung and infected regions. Then, the samples were carefully adjusted and improved by twoprofessional radiologists to generate the final training set and test benchmark. Two deep learningmodels were evaluated: UNet and 2.5D UNet. For the segment of infected regions, a deep learningbased classifier was followed to remove unrelated blur-edged regions that were wronglysegmented out such as air tube and blood vessel tissue etc.For the segmented masks of intact lung and infected regions, the best method could achieve 0.972and 0.757 measure in mean Dice similarity coefficient on our test benchmark. As the overallproportion of infected region of lung, the final result showed 0.961 (
Pearson's correlationcoefficient ) and 11.7% ( mean absolute percent error).
The instant proportion of infected regions of lung could be used as a visual evidence to assistclinical physician to determine the severity of the case. Furthermore, a quantified report ofinfected regions can help predict the prognosis for COVID-19 cases which were scannedperiodically within the treatment cycle.Keywords: COVID-19, deep learning, coronavirus pneumonia, CT image set ntroduction
Coronavirus disease 2019 or COVID-19 had become a worldwide pandemic and caused greatpublic health problems[1-3]. COVID-19 cases can be divided into light, moderate, severe, andextremely severe types from physicians' perspective. Patients of the latter two types exhibited tohave a higher intensive care unit (ICU) rates as well as the death rates [4,5] compared with theother two types. It is therefore essential to identify severe and extremely severe patients as early aspossible.In the Diagnosis and Treatment Protocol for COVID-19 (version 7) [6] released by NationalHealth Commission of China, the clinical characteristics of severe cases include: the decreasedlymphocyte count, increased level of inflammatory factors, and rapid development of volume forinfected regions on CT images. One patient should be classified and treated as severe case if thevolume of infected region would be increased more than 50% within 48 hours. Therefore,continuously monitoring the volume of infected regions may provide valid evidence to predict theprognosis for COVID-19 patients.However, with infected regions in the lung, the Hounsfiled Unit (HU) value of the lesionregions would be difficult to distinguish with healthy tissues. The infected regions may illustrateas a mist of blur-edged cloud or adhere together with normal tissues on CT images. It would beeffort costly for a professional radiologist or physician to separate these lesion regions fromhealthy lung parenchyma. Furthermore, a set of CT images usually consisted of dozens orhundreds of lung images, which made it almost impossible to analyze the lesion regionsquantitatively over the images manually. Therefore, it was urgent to find out an automatic methodto estimate the proportion of infected region of lung for COVID-19 from chest CT scans.To date, several researches had concentrated on the deep learning based models fordiagnosing COVID-19. Some studies [7-10] demonstrated that COVID-19 can be distinguishedfrom other types of pneumonia with good accuracy. Compared with the classifications models, theannotation of CT image samples is highly significant and much more time-consuming in thetraining of segmentation models for the intact lung as well as infected regions. Shan et al. [11]adopted the human-in-the-loop strategy to iteratively update the annotation of their trainingamples. Liu et al. [12] synthesized part of their training and test dataset with GenerativeAdversarial Network (GAN). They achieved 0.706 measured in mean Dice similarity coefficient(m-Dice) for the segmentation of infected regions and 0.961 (
Pearson correlation coefficient ) for thetotal percent volume of lung parenchyma that was affected by disease. Ma et al. [13] annotated 20sets of COVID-19 CT images and utilized previous available lung dataset such as lung cancer toassist the segmentation. Yan et al. [14] also investigated the segmentation of infected regions dueto COVID-19. They employed a team of six annotators with deep radiology background andproficient annotating skills to label the areas and boundaries of the intact lung and infectionregions due to COVID-19. A feature variation block in the segmentation of infected regions wasintroduced, which could better differentiate the diseased area from the lung. Furthermore, theyused the more effective progressive atrous spatial pyramid pooling in the feature extraction stageas well. The optimum m-Dice achieved in their studies for entire lung and infected regions were0.987 and 0.726 respectively. These studies suffered from the tremendous effort to label thetraining samples as well as the relatively low accuracy measured in m-Dice.In this study, we try to establish a fully automatic deep learning system to estimate thephysical proportion of infected region of lung for COVID-19 pneumonia with CT image set. Themain contribution of this paper can be summarized as:First, the HU value of CT images was utilized as threshold to generate the initial "dirty"version of labeled masks for both intact lung and infected regions. These first round labeledsamples significantly alleviated the labor expenses of annotation compared with start from scratch.Then these preliminary samples were revised and improved by two professional radiologists togenerate the final training set and test benchmark.Second, it was observed from our experiment that a certain number of blur-edged healthystructures, which had similar appearance as infected regions, were likely to be identifiedincorrectly. These kind of healthy tissues included air tube, blood vessel, and blur region of lung atthe border etc. Therefore, a deep learning based classifier was employed to further clarifycandidate regions that could effectively increase the accuracy of the final segmentation results. Inaddition, the proposed classifier was much easier to be trained compared with pixel-levelsegmentation models. aterials and methods
Study Dataset
A total of 129 transverse–section CT samples were collected, including 105 from 105patients (mean age 51 years; 58 [55.2%] male patients) with COVID-19 from the First AffiliatedHospital of Zhejiang University, from January 19 to March 31, 2020. Every COVID-19 patientwas confirmed with reverse transcription polymerase chain reaction (RT–PCR) kit, and cases withno image manifestations on the chest CT images were excluded. There were 80 (62.0%)COVID-19 from light to moderate types, and the remaining 49 (38.0% ) cases from severe toextremely severe types respectively. All CT imaging was in the format of digital imaging andcommunications in medicine (DICOM) with 5mm thickness between slices.The study was approved by the ethics committee of the First Affiliated Hospital, School ofMedicine, Zhejiang University and all research was performed in accordance with relevantguidelines and regulations. All participants and/or their legal guardians signed the informedconsent form prior to commencing the study.A total of 108 CT samples (83.7%) were used for training and validation datasets and theremaining 21 CT sets (16.3%) were used as a test benchmark.(a) (b)
Figure 1 . Typical labeled CT images: (a) CT image with pneumonia; (b) CT image withoutneumonia. The fields within the blue line denote the masks for the intact lung and those withinthe red line represent the masks for infected regions.
Process
Figure 2 showed the whole diagnostic process of COVID-19 report generation in this study.As the digital gray scale image had the pixel value ranging [0, 255], the raw data of CT wereconverted from HU to the aforementioned values interval accordingly. The HU data matrix wasclipped within [–1200, 600] (any value beyond this was set to –1200 or 600 accordingly) and thenlinearly normalized to [0, 255] to fit into the digital image format for further processing. Next, theinfected regions and the intact lung were segmented separately to achieve corresponding masks.For the differentiating of infected regions, a deep learning based classifier was utilized to removeunrelated masks that were wrongly distinguished. Finally, the volumes of two parts werecalculated according to the masks achieved in above-mentioned steps, and achieve the proportionof infected regions in lung.
Figure 2 . Study flow chart.
Evaluation criteria he performance of the proposed method was evaluated using the Dice similarity coefficient(Dice), measuring the similarity between the ground truth and the prediction score maps. It iscalculated as follows:
Dice A B (1)where A is the volume of the segmented lesion region and B denotes the ground truth. The meanDice (m-Dice) of the whole test benchmark was used to evaluate the final outcomes. Two groundtruth masks were used in this study: ground truth for intact lung and ground truth for infectedregions. The proportion of infected regions of lung (PoIR) was given by: volume of infected regions 100%volume of intact lung
PoIR (2)Pearson's correlation coefficient was used to evaluate the correlation of two variables: ( ) ( ) i i i ii i ii i i ii i i i
N x y x yr N x x N y y (3)where N is the total number of observations, x i and y i , i=1 , ..., N , are observational variables. Weused Pearson's correlation coefficient to calculate the correlation between predicted PoIRs and thecorresponding value derived from ground truth. Furthermore, mean absolute percent error(m-APE) , which is a assessing of prediction accuracy of a forecasting method, was herein used tomeasure the relative errors between the mean predicted PoIRs and the ground truth value on thetest benchmark. n predicted ground truthi ground truth PoIR PoIRmAPE n PoIR (3) Use the value of HU
The most straightforward way to segment desired lung regions was by aid of the value of HUas the threshold, which reflects the degree of X-ray absorption of different tissues. For instance,the HU value for lung parenchyma usually ranges from –800 to –500 and window of other softissue is from +100 to +300. This margin usually is solid enough to separate lung with othertissues. However, when there existed infected regions in lung, the HU value of lung parenchymacould extended to from -750 to 150 (based on our statistics on test benchmark in Fig 3). Therefore,the segmentation result with the threshold of HU usually is not typically accurate enough forclinical application. Alternatively, these labeled images could be used as the initial annotatedsamples in our study.
Figure 3 . The distribution of HU value base on the ground truth of the proposed test benchmark.(Included some air tube and blood vessel tissue etc as they were hard to separation with lungstructures). Most of the pixels were located within [–750, 150].First, arithmetic progression of HU value (from –800 to 0 with increment of 50) was used asthe threshold to segment the intact lung to achieve their corresponding masks. The segmentedmasks with maximum m-Dice (compared with the ground truth of intact lung) were used as themask for intact lung. In the next step, use this mask for intact lung to minus the masks obtainedwith different HU value to obtain the mask of their difference values. The masks with differencen the maximum m-Dice (compared with the ground truth of infected regions) were used as themasks of infected regions.(a) (b) (c) (d)
Figure 4. (a) Ground truth of intact lung (green line) and infected regions (yellow line); (b) maskof lung obtained with HU = –200 (red line); (c) mask of lung obtained with HU = –750 (red line);(d) mask of infected regions was obtained by (b) minus (c) (red line).
Deep learning to segment the intact lung and the infectious regions
As mentioned earlier, use of HU threshold cannot properly segmented the intact lung andinfected regions. Those infected regions that had a close HU value with other soft tissues cannotbe correctly differentiated. Therefore, deep learning techniques were utilized and evaluated incurrent study.Training samples with detailed sketch of each infected region and intact lung are highlyessential for developing the deep learning models. However, due to ambiguous edge betweeninfected region and normal tissue, it was extremely timing-consuming to annotate thousands oflung CT images. The annotation result achieved by HU threshold was utilized for the preliminarysamples. Then, two professional radiologists further manually contoured the intact lung andinfected region based on the these "dirty" samples to generate final sample dataset for training andtest.
Network structure
Two deep learning models were utilized: 2D UNet [15] (Fig. 5) and 2.5D UNet (Fig. 6). Atwo-dimensional (2D) deep learning models can well reflect the intra–slice information. However,hey may neglect the inter–slice information and cannot fully leverage the spatial architecture ofthe three-dimensional (3D) slices of CT scans. On the other hand, 3D models [16, 17] suffer fromtremendous increased parameters and the subsequent of hard to converge and overfittingespecially for a limited number of training samples. Furthermore, due to the limitation of GPUmemory, the original CT images had to be cropped or resized to small-sized cubes as the input fordeep learning models. This crop or resize operation would either restrict the maximum inceptionregions or attenuate the resolution of original CT images. Therefore, a pseudo-3D segmentation orso called 2.5D UNet [18, 19] was used for evaluation purposes, in which the same UNet backbone(with expanded of network parameters) was used. In addition, three neighboring 2D slices werestacked as the inputs during training, so that the 2D network was able to detect a small range of3D contexts each time. Three masks of image would be generated each time and the average valueof segmentation maps would be used as the final masks for overlaps.
The proposed two networkswould be used both for the segmentation of intact lung and infected regions.
Figure 5.
UNet model igure 6.
Using a Classifier to further clarify infected regions
It was observed that a certain number of blue-edged healthy structures, which had similarappearance as infected regions, were likely to be identified incorrectly. These kind of healthytissues included air tube, blood vessel, and blur region of lung at the border etc as shown in Fig 7.(a) (b) (c) (d)
Figure 7.
Regions in the red line are healthy regions (a) air tube and blood vessel; (b) (c) (d)healthy blur-edged regions at the border.Therefore, a ResNet-18[20] based binary classifier (Fig. 8), was utilized after thesegmentation models to further clarify whether an image patch belonged to infected regions or not.The masks corresponding to healthy regions were filtered. Compared with time-consumingpixel-level annotation on the blur-edged infected region, the training samples of this binarylassification model could be relatively easily to be labeled and trained.The candidate images from the output of segmentation model were firstly enclosed in aminimum circumscribed square bounding box. Then these image patches were used as the inputdata for the binary classifier to determine the valid existence of infected regions. ClassicalResNet-18 network backbone was employed for image feature extraction part of the classifier. Atthe same time, generic data expansion mechanisms such as random clipping and left–right flippingwere performed on specimens to increase the number of training samples and prevent dataoverfitting and improve the problem of generalization. The output of the convolution layer wasflattened to a 256-dimensional feature vector, followed by three full-connection layers to exportthe final binary classification result.
Figure 8.
The network structure of ResNet18-based binary classification model
Experiment results
Segment the intact lung and infected region with the threshold of HU as the initial annotatedsamples. Arithmetic progression value of HU (from –800 to 0 with the increment of 50) was usedas the threshold to segment the intact lung directly to generate corresponding masks. Eachoutcome masks were evaluated on the test benchmark cases, as shown in Fig 9. The maximumm-Dice for intact lung and infected region was 0.921 (HU = –200) and 0.530 (HU = –750),respectively. The accuracy of m-Dice significantly decreased as many unrelated regions, e.g.tomach, were wrongly segmented as lung when the threshold of HU was set to below –150.
Figure 9.
The maximum m-Dice for intact lung (HU = –200) and infected regions (HU = –750)could be achieved.
Generate mask for intact lung using deep learning models
UNet and 2.5D UNet were utilized in the present study. For regions with light opacity, thethreshold of HU along could achieve satisfactory segmentation results. However, for regions withhigh opacity, deep learning models could achieve obviously superior results. The m-Dice of UNetand 2.5D UNet of intact lung were 0.972 and 0.967, respectively. There was no significantdifference between the two deep learning models.a) (b) (c) (d)(e) (f) (g) (h)
Figure 10.
The first row of CT images were belongs to a case with light opacity, and the secondrow belongs to a case with high opacity. (a) Ground truth mask (green line); (b) Segmentationwith HU (–200); (c) Segmentation with UNet; (d) Segmentation with 2.5D UNet; (e) Ground truthmask (green line); (f) Segmentation with HU (–200); (g) Segmentation with UNet; (h)Segmentation with 2.5D UNet.
Generate mask for infected regions
The same UNet and 2.5D UNet deep learning models were evaluated. The m-Dice of UNetand 2.5D UNet for infected regions were 0.684 and 0.693, respectively. The latter model furtherconcentrated on the inter-slice characteristics and demonstrated 1.3% improvement on the results.It was observed that most of the infected regions could be included in the output of the segmentedmasks. However, many blur-edged normal tissues were wrongly detected as infected regions,including air tube, blood vessel, stomach, and part of the border of lung etc. Therefore, a classifierwas followed to further remove those unrelated regions.
Performance of binary classifier he receiver operating characteristic curve (ROC) for the ResNet-18 based classifier wasdepicted in Figure 11. The value of area under curve (AUC) was 0.913, and when that valveequals to 0.45, the classification exhibited the best performance with accuracy of 93.8%. Them-Dice of UNet and 2.5D UNet of infected regions were 0.743 and 0.758 after the filtering thisclassification model, respectively. Compared with the results directly from the segmentation, them-Dice improved 8.6% and 9.2%, respectively. The illustration of the segmentation result ofinfected regions and the summarized m-Dice for different methods were showed in Fig. 12 andTable 1.
Figure 11.
The ROC for the binary classifier.(a) (b) (c) (d)e) (f) (g) (h)
Figure 12.
The first row of CT images belongs to a case with light opacity and the second rowbelongs to a case with high opacity. Ground truth mask marked with green line and predictedinfected region marked in red line. (a) The prediction of UNet; (b) Further clarified by classifier;(c) The predictions of 2.5D UNet; (d) Further clarified by classifier; (e) The prediction of UNet; (f)Further clarified by classifier; (g) The prediction of 2.5D UNet; (h) Further clarified by classifier.Methods Mean DiceIntactlung HU only (= –200) 0.941UNet
Summarized m-Dice between predicted masks and the measurement derived from groundtruth of our test benchmark.With the aid of the masks of the intact lung and infected regions, the Pearson's correlationcoefficient of the PoIRs could be achieved (0.961), which showed a very strong correlationbetween the predicted masks and those derived from ground truth. Furthermore, the m-APE of thePoIRs on test benchmark also could be obtained (11.7%), which indicated that the average relativeerrors between predicted PoIRs and the ground truth value was a lightly more than 10%. iscussion & conclusions
With the rapid development of artificial intelligence technology, experiences of professionradiologists, such as the segmentation of medical images, could be solidified in the deep learningmodels to accomplish a quantitative analysis report. Several methods were developed toinvestigate the segmentation of intact lung and infected regions, including the threshold of HU,UNet, and 2.5D UNet. In addition, a fine-tuned classifier was followed to further remove thosewrongly segmented healthy regions to improve the accuracy of outputs. We referred to themethodology of the design of nnUNet [21], which had achieved good results in many differentmedical segmentation tasks. They suggested that if the objected data is very anisotropic then a 2DUNet may actually be a better choice. For example, in the segmentation of pancreas, which was ablur-edged objective on the images as well, 2D network actually outperformed 3D counterparts.As a matter of fact, the most challenge work in the calculation of the proportion of infectedregions was the annotation of images, especially for the regions that affected by pneumonia. Weutilize the intrinsic HU value of CT images to create the initial version of label images. Eventhough they were dirty samples, it would be much less effort for professional radiologist to furthermodify and improve on this first round version. Furthermore, the annotation of samples and thetraining of a fine-tuned binary classifier were much easier than the pixel-level of segmentation.Compared with direct result of the state-of-the-art segmentation algorithm, the classifier couldimprove the m-Dice of infected regions around 9%.For the calculation of proportion of infected regions, the Pearson's correlation coefficientbetween predicted and the ground truth showed a strong correlation between them, which wouldbe one of an objective indicator for monitoring the progress of one patient at a fixed interval.Furthermore, the m-APE showed promising outcomes for the reference for the decision of clinicalphysicians.In the future, doctors can carry out quantitative analysis of the severity of COVID-19 patientswith this model or combined with other clinical data such as blood oxygenation index. At the sameime, they can compare the sequential CT scans of the same case to predict the prognosis andprovide reliable basis for treatment.However, this study had several limitations. In some cases, the segmentation models wouldpossible identify healthy tissues together with valid infected regions and the following classifiercould not remove this "valid" infected regions. Therefore, the corresponding mask in suchscenario would be larger than the ground truth. Moreover, additional COVID-19 CT cases fromdifferent subtypes should be included to promote the accuracy of segmentation and classification.Some atypical infection signs, such as pleural effusions, cannot be distinguished with this model.
Acknowledgements
This study was supported by the Zhejiang province natural science fund for emergency research(LED20H190003)This study was supported by the China national science and technology major project fund(20182X10101–001)
Compliance with ethics guidelines
All authors declare that they have no conflict of interest or financial conflicts to disclose. eferences [1] Zhu N, Zhang D, Wang W, et al. A Novel Coronavirus from Patients with Pneumonia inChina, 2019. N Engl J Med. 2020;382(8):727–733.[2] Li Q, Guan X, Wu P, et al. Early Transmission Dynamics in Wuhan, China, of NovelCoronavirus-Infected Pneumonia. N Engl J Med. 2020;382(13):1199–1207.[3] Cohen J, Normile D. New SARS-like virus in China triggers alarm. Science.2020;367(6475):234–235. [4]
Ruan Q, Yang K, Wang W, et al. Clinical predictors of mortality due to COVID-19 based onan analysis of data of 150 patients from Wuhan, China. Intensive Care Med.2020;46(5):846 ‐‐