CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT Scans
Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md Maisoon Rahman, Shaikh Anowarul Fattah, Mohammad Saquib
IIEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 1
CovTANet: A Hybrid Tri-level Attention BasedNetwork for Lesion Segmentation, Diagnosis, andSeverity Prediction of COVID-19 Chest CT Scans
Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md Maisoon Rahman,Shaikh Anowarul Fattah,
Senior Member, IEEE, and Mohammad Saquib,
Senior Member, IEEE
Abstract —Rapid and precise diagnosis of COVID-19 is one ofthe major challenges faced by the global community to controlthe spread of this overgrowing pandemic. In this paper, a hybridneural network is proposed, named CovTANet, to provide anend-to-end clinical diagnostic tool for early diagnosis, lesionsegmentation, and severity prediction of COVID-19 utilizing chestcomputer tomography (CT) scans. A multi-phase optimizationstrategy is introduced for solving the challenges of complicateddiagnosis at a very early stage of infection, where an efficientlesion segmentation network is optimized initially which is laterintegrated into a joint optimization framework for the diagnosisand severity prediction tasks providing feature enhancement ofthe infected regions. Moreover, for overcoming the challengeswith diffused, blurred, and varying shaped edges of COVIDlesions with novel and diverse characteristics, a novel segmenta-tion network is introduced, namely Tri-level Attention-based Seg-mentation Network (TA-SegNet). This network has significantlyreduced semantic gaps in subsequent encoding decoding stages,with immense parallelization of multi-scale features for fasterconvergence providing considerable performance improvementover traditional networks. Furthermore, a novel tri-level attentionmechanism has been introduced, which is repeatedly utilizedover the network, combining channel, spatial, and pixel attentionschemes for faster and efficient generalization of contextualinformation embedded in the feature map through feature re-calibration and enhancement operations. Outstanding perfor-mances have been achieved in all three-tasks through extensiveexperimentation on a large publicly available dataset containing1110 chest CT-volumes that signifies the effectiveness of theproposed scheme at the current stage of the pandemic.
Index Terms —COVID-19, computer tomography (CT) scan,computer-aided diagnosis, lesion segmentation, neural network.
I. I
NTRODUCTION S INCE the onset of the coronavirus disease (COVID-19)in December-2019, it has severely jeopardized the globalhealthcare systems for its extremely infectious nature. With itsextreme infectious nature and high mortality rate, it has beendeclared as one of the most devastating global pandemics ofhistory [1]. Although reverse transcription-polymerase chainreaction (RT-PCR) assay is considered as the gold standard
T. Mahmud, M. J. Alam, S. Chowdhury, M. M. Rahman, and S. A. Fat-tah are with the Department of Electrical and Electronic Engineer-ing, and S. N. Ali is with the Department of Biomedical Engineering,Bangladesh University of Engineering and Technology, Bangladesh, e-mail: { tanvirmahmud, fattah } @eee.buet.ac.bd, { jahinalam.eee.buet, sakibchowd-hury131, snafisa.bme.buet, 2maisoon1998 } @gmail.comM Saquib is with Department of Electrical Engineering, The University ofTexas at Dallas, USA, email: [email protected] for COVID-19 diagnosis, the shortage of this expensive test-kit coupled with the elongated testing protocol and relativelylow sensitivity (60-70%) calls for an alternative diagnostictool that is adequately efficient to perform prompt mass-screening [2]. For providing immediate and proper clinicalsupport to the critical patients, severity quantification of theinfection is also a dire need. With numerous success storiesin the field of clinical diagnostics and biomedical engineering,artificial intelligent (AI) assisted diagnostic paradigms can beembraced as a medium of paramount importance to conductautomated diagnosis and severity quantification of COVID-19 with substantial accuracy and efficiency [3], [4]. Severaldeep learning-based frameworks have been explored in recenttimes deploying automated screening of chest radiographyand computer tomography as one of the vital sources ofinformation for COVID diagnosis [5]–[9]. However, owing tothe relatively higher sensitivity and the provision of enhancedinfection visualization in the three-dimensional representation,CT-based screening is a more viable alternative than the X-raycounterparts. With a large number of asymptomatic patients,early detection of COVID-19 through imaging modalitiesis still a stupendously challenging task due to significantlysmaller, scattered, and obscure regions of infections that aredifficult to distinguish [10]. These diverse heterogeneous char-acteristics of infections among different subjects also makethe severity prediction to be an extremely difficult objectiveto achieve [11]. The scarcity of considerably large reliabledatasets further increases the complexity of the endeavor.Recent studies mostly opt for solving this daunting taskpartially where infection segmentation, diagnosis, or severityanalysis have separately attempted [12]–[14]. Such methodslack the complete integration of the objectives for providing arobust clinical tool.Deep learning-based approaches have been widely incor-porated in diverse medical imaging applications for theirunprecedented performance even in challenging conditions.Several state-of-the-art networks have been investigated forthe segmentation of COVID lesions from chest CTs includingFCN [15], U-Net [16], UNet++ [17], and ResUNet [6]. How-ever, precise segmentation of COVID lesions has still beena major challenge due to the patchy, diffused, and scattereddistributions of the infections involving ground-glass opacities,pleural effusions, and consolidations [18]. Traditional U-Netand its variants with similar encoder-decoder architectures suf-fer from increased semantic gaps between the corresponding a r X i v : . [ ee ss . I V ] J a n EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 2
Fig. 1.
Graphical overview of the optimization scheme of CovTANet: Tri-level Attention based Segmentation Network (TA-SegNet) extracts theslicewise lesion segmentation mask and representational features of the corresponding CT-volume which are employed later for the joint optimizationof severity prediction and diagnosis. Separate Tri-level Attention Units (TAUs) are employed to enhance the diagnostic features and severity basedfeatures in the joint optimization process. scale of feature maps of encoder-decoder modules while expe-riencing vanishing gradient problems with several optimizationissues due to the sequential optimization strategy of multi-scale features. Moreover, the contextual information generatedat different scales of representation does not properly convergeinto the final reconstruction of the segmentation mask that re-sults in sub-optimal performances. Lately, numerous attention-gated mechanisms have been paired up with the traditionalsegmentation frameworks and demonstrated highly promisingperformance in terms of gathering more contextual informationthrough redistribution of the feature space [12], [19].In this paper, CovTANet, an end-to-end hybrid neuralnetwork is proposed, that is capable of performing precisesegmentation of COVID lesions along with accurate diagnosisand severity predictions. The intricate network of the proposedscheme emerges as an effective solution by overcoming thelimitations of the traditional approaches. The major contribu-tions of this work can be summarized as follows:1) A novel tri-level attention guiding mechanism is pro-posed combining channel, spatial and pixel domains forfeature recalibration and better generalization.2) A tri-level attention based segmentation network (TA-SegNet) network is proposed for precise segmentationof COVID lesions integrating the triple attention mech-anisms with parallel multi-scale feature optimization andfusion.3) A multi-phase optimization scheme is introduced byeffectively integrating the initially optimized TA-SegNetwith the joint diagnosis and severity prediction frame-work.4) A system of networks is proposed for efficient process-ing of CT-volumes to integrate all three objectives forimproving performance in challenging conditions. 5) Extensive experimentations have been carried out overa large number of subjects with diverse levels andcharacteristics of infections.II. M
ETHODOLOGY
The proposed CovTANet network is developed in a mod-ular way focusing on diverse clinical perspectives includingprecise COVID diagnosis, automated lesion segmentation, andeffective severity prediction. The whole scheme is representedin Fig. 1. Here, a hybrid neural network (CovTANet) isintroduced for segmenting COVID lesions from CT-slicesas well as for providing effective features of the region-of-lesions which are later integrated for the precise diagnosis andseverity prediction tasks. The complete optimization processis divided into two sequential stages for efficient processing.Firstly, a neural network, named as Tri-level Attention-basedSegmentation Network (TA-SegNet), is designed and opti-mized for slicewise lesion segmentation from a particular CT-volume. A tri-level attention gating mechanism is introducedin this network with multifarious architectural renovationsto overcome the limitations of the traditional Unet network(Section II-B), which gradually accumulates effective featuresfor precise segmentation of COVID lesions. Because of thepertaining complicacy with blurred, diffused, and scatteredpatterns of COVID lesions, it is quite obvious that directutilization of the final segmented portions for diagnosis mayresult in loss of information due to some false positiveestimations. The proposed CovTANet aims to resolve thisissue by extracting effective features regarding the regions-of-infection utilizing the initially optimized TA-SegNet as itis optimized for precisely segmenting COVID lesions withdiverse levels, types, and characteristics. Therefore, slicewiseeffective features are extracted utilizing the optimized TA-SegNet network and deployed into the second phase of training
EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 3
Algorithm 1:
Training and Optimization of CovTANet
Data:
CT-Volume Data, V ; Segmentation masks, M ;Diagnostic labels, Y D ; Severity labels, Y S Result: weight matrices, W T A − SegNet , W RF ex , W T AU , W F d , W F s , W fc , W fc /* Optimize the TA-SegNet */// Extract 2D CT-slices and masks Get S V = Slice-Extractor ( V ) Get S M = Slice-Extractor ( M ) Initialize weights W T A − SegNet while training loss, L Seg > threshold ( (cid:15) ) do Calculate, S pM = W T A − SegNet ( S V ) Calculate loss, L Seg = L F T L ( S pM , S M ) Run optimizer and update W T A − SegNet end /* Optimize the Classifier Unit */
Initialize W RF ex , W T AU , W F d , W F s , W fc , W fc while training loss, L > threshold ( (cid:15) ) dofor i ← to N do // Extract Per-Slice Features for j ← to s do Get f i,j = W intmdT A − SegNet ( S V i ) Get F i,j = W RF ex ( f i,j ) end // Aggregate volumetric features Get A diagnostic = W F d ( W T AU ( F i,k )) | k = sk =1 Get A severity = W F s ( W T AU ( F i,k )) | k = sk =1 // Generate predictions Get Y pD = σ ( W fc ( A diagnostic )) Get Y pS = σ ( W fc ( A severity )) end Calculate L = L ( Y D , Y pD ) + L ( Y D , Y S , Y pS ) Run optimizer and update W RF ex , W T AU , W F d , W F s , W fc , W fc end N, s denote number of CT-volumes and slices per-volume, respectively. W intmdTA − SegNet represents the intermediate part of the TA-SegNet weightmatrix for providing the per-slice feature vector. for the joint optimization of diagnosis and severity predic-tion tasks. Additionally, separate regional feature extractorsare employed for generating more generalized forms of theslicewise feature vectors from different lung regions. Sub-sequently, these generalized feature representations of CT-slices are guided into separate volumetric feature aggregationand fusion schemes through the proposed tri-level attentionmechanism for extracting the significant diagnostic featuresas well as severity based features. The diagnostic path issupposed to extract the more generalized representation ofinfections while the severity path is more concerned withthe levels of infections. Both the diagnostic and severitypredictions are optimized through a joint optimization strategywith an amalgamated loss function. The whole training andoptimization process is summarized in Algorithm 1. In addi-tion, the optimization flow of the complete CovTANet networkis shown in Fig. 2. Several architectural submodules of theCovTANet are discussed in detail in the following sections.
Fig. 2.
Optimization flowchart of the proposed CovTANet network.
A. Proposed Tri-level Attention Scheme
Attention mechanism, first proposed in [20] for enhancedcontextual information extraction in natural language process-ing, has been adopted in numerous fields including medicalimage processing [21], [22]. This mechanism assists fasterconvergence with considerable performance improvement byeliminating the redundant parts while putting more attentionon the region-of-interests through the generalization of thepredominant contextual information. In this work, we haveproposed a novel self-supervised attention mechanism com-bining three levels of abstraction for improved generalizationof the relevant contextual features, i.e. channel-level, spatial-level, and pixel-level. The channel attention (CA) mechanismoperates on a broader perspective to emphasize the corre-sponding channels containing more information, while thespatial attention (SA) mechanism concentrates more on thelocal spatial regions containing region of interests, and finally,the pixel attention (PA) mechanism operates on the lowestlevel to analyze the feature relevance of each pixel. However,relying only on the higher level of attention causes loss ofinformation while relying on lower/local levels may weakenthe effect of generalization. Hence, to reach the optimum pointof generalization and re-calibration of feature space, we haveintroduced a tri-level attention unit (TAU) mechanism thatintegrates the advantages of all three levels of attention. ThisTAU unit module is repeatedly used all over the CovTANetnetwork (Fig. 1) to improve the feature relevance throughfeature recalibration.In general, the proposed attention mechanisms operating atdifferent levels of abstraction (shown in Fig. 3) can be dividedinto two phases: a feature re-calibration phase followed bya feature generalization phase. In each phase, a statisticaldescription of the intended level of generalization is extracted,which is processed later for generating the correspondingattention map. Let, F in ∈ R H × W × C be the input feature mapwhere ( H, W, C ) represent the height, width, and channelsof the feature map, respectively. Here, channel description, D c ∈ R × × C is generated by taking the global averages ofthe pixels of particular channels, while the spatial description, D s ∈ R H × W × is created by convolutional filtering, andthe input feature map, F in represents the pixel description, D p ∈ R H × W × C itself.Afterwards, the feature re-calibration phase is carried outby projecting the descriptor vector D to a higher dimensionalspace followed by the restoration process of the originaldimension to generate the re-calibration attention map A r ,which is utilized to obtain the re-calibrated feature map F r . This process assists in the redistribution of the featurespace in the subsequent feature generalization phase for bettergeneralization of features through sharpening the effective EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 4
Fig. 3.
Schematic of the proposed channel, spatial, and pixel attention mechanisms. Each attention mechanism integrates feature re-calibrationoperation through expand-excitation scheme followed by feature generalization operation through squeeze-excitation scheme.
Fig. 4.
Schematic of the proposed Tri-level Attention Unit (TAU)integrating channel attention (CA), spatial attention (SA), and pixelattention (PA) mechanisms. Here, channel broadcasting operation iscarried out before element wise multiplication/addition of feature maps. representative features. It can be represented as: F r = F in ⊗ A r = F in ⊗ σ ( W R ( W E ( D ))) (1) = F in ⊗ σ ( W R ( W E ( W D ( F in )))) (2)where ⊗ represents the element-wise multiplication with therequired dimensional broadcasting operation, W D denotes thestatistical descriptor extractor, W E represents the dimensionexpansion filtering, W R represents the dimension restorationfiltering, and σ ( · ) represents the sigmoid activation. For thechannel-attention mechanism, W E and W R are realized byfully connected layers, while for spatial and pixel attention,convolutional filters are employed.Subsequently, the feature generalization operation is carriedout through the squeeze and excitation operation on the re-calibrated feature space, F r to generate the effective attentionmap A . In this phase, the extracted feature descriptor, D (cid:48) isprojected into a lower-dimensional space to extract the mosteffective representational features and thereafter, reconstructedback to the original dimension. Such sequential dimensionreduction and reconstruction operations provide an opportunityto emphasize the generalized features while reducing theredundant features. Hence, the generated attention map A provides the opportunity to reduce the effect of redundant features by providing more attention to the effective features,and it can be represented as: A = σ ( W R (cid:48) ( W S ( D (cid:48) ))) = σ ( W R (cid:48) ( W S ( W D (cid:48) ( F r )))) (3)where W S , W R (cid:48) represents the corresponding squeeze andrestoration filtering, respectively, while W D (cid:48) represents thestatistical descriptor extractor. Therefore, three levels of at-tention maps are generated, i.e. a channel attention map A C ∈ R × × C , a spatial attention map A S ∈ R H × W × ,and a pixel attention map A P ∈ R H × W × C . The tri-levelattention unit (TAU), represented in Fig. 4, generates theeffective volumetric, triple attention mask A T integrating allthree maps, which is given by: A T = A P ⊗ ( A S ⊗ A C ) (4)Later, this accumulated attention mask A T is used totransform the input feature map F in to F (cid:48) for enhancing theregion-of-interest, and finally the output feature map, F out is generated through the weighted addition of the input andtransformed feature maps, and these can be summarized as: F (cid:48) = F in ⊗ A T (5) F out = T ( F in ) = αF in + (1 − α ) F (cid:48) (6)where T( · ) represents the proposed Tri-level attention mecha-nism, α is a learnable parameter that is optimized through theback-propagation algorithm along with other parameters. B. Proposed Tri-level Attention-based Segmentation Network(TA-SegNet)
The proposed TA-SegNet network is deployed for segment-ing the infected lesions as well as for extracting features forthe following joint diagnosis and segmentation tasks (as shownin Fig. 1). For better segmentation, this network introducesseveral modifications over traditional networks which aremostly based on Fully convolutional networks (FCN) and Unetnetworks generally.FCN and Unet are the most widely explored networksfor medical image segmentation. In FCN, a single stage of
EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 5
Fig. 5.
Schematic representation of the proposed Tri-level Attention-based Segmentation Network (TA-SegNet) integrating numerous tri-level attentionunit (TAU) modules for semantic gap reduction between encoder and decoder modules as well as for efficient reconstruction of lesion-mask.
Fig. 6.
Representation of the proposed regional feature extractor module encoder module is employed to generate different scales ofencoded feature maps from the input image, and afterwards,the segmentation mask is reconstructed through joint process-ing of multi-scale encoded features. Whereas, the Unet net-work considerably improved the performance by introducinga decoder module followed by the encoder module to sequen-tially gather the contextual information of the segmentationmask. Moreover, to recover the loss of information throughdownscaling, each level of encoder and decoder modulesare directly connected through skip connections in Unet.Despite that, a semantic gap is generated through such directskip connections between the corresponding scale of featuremaps of the encoder and decoder modules, which hindersproper optimization. For the deeper implementation of theencoder/decoder module, this network further suffers fromvanishing gradient problem since different scales of featuremaps are optimized sequentially.The proposed TA-SegNet network (shown in Fig. 5) inte-grates the advantages of both Unet and FCN by introducingan encoder-decoder based network with reduced semanticgaps along with the opportunity of parallel optimization ofmulti-scale features. Firstly, the input images pass through sequential encoding stages with convolutional filtering fol-lowed by sequential decoding operations similar to the Unet.Moreover, the output feature map generated from each layerof the encoder unit is connected to the corresponding decoderlayer through a Tri-level Attention Unit (TAU) mechanismfor better reconstruction in the decoder unit. For furthergeneralization and refinement of contextual features, all scalesof decoded feature representations also pass through anotherstage of the attention mechanism. Afterwards, for introduc-ing joint optimization of multi-scale features, the attentiongated, refined feature maps generated at different stages ofencoder and decoder modules are accumulated through a seriesof operational stages. Initially, sequential concatenation ofcorresponding encoder-decoder layer outputs (after attention-gating) are carried out. Following that, channel downscalingoperations through convolutional filtering and bi-linear spa-tial upsampling operations are employed to produce featurevectors with uniform dimensions. Afterwards, these uniformfeature vectors are accumulated through channel-wise con-catenation to generate the fusion vector F fus , and it can berepresented as: F fus = F Ni =1 ( T ( E i ) ⊕ T ( D i )) (7)where ⊕ represents feature concatenation, E i , D i stand for i th level of feature representations from total N levels of theencoder, and decoder modules, respectively, T ( · ) representsthe tri-level attention unit operation, and F ( · ) represents themulti-scale feature fusion operation.Afterwards, the final convolutional filtering is operated onthe fusion feature map ( F fus ) to produce the output segmenta-tion mask. Moreover, to introduce transfer-learning in this TA-SegNet similar to other networks, the encoder module can bereplaced by different pre-trained backbone networks for betteroptimization. Hence, the proposed TA-SegNet facilitates fasterconvergence through parallel optimization of the multi-scalefeatures while effectively extracting the region-of-interest fromeach scale of representation with the novel tri-level attentiongating mechanism for providing the optimum performanceeven in the most challenging conditions. EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 6
C. Proposed Regional Feature Extractor Module
Though the proposed TA-SegNet is optimized for providingprecise segmentation performances on challenging COVIDlesion extraction, some loss of information is expected tooccur especially at the early stages of infection when it isdifficult to extract relatively smaller and scattered infectionpatches. To overcome the limitation, the final fusion vector F fus generated at TA-SegNet is incorporated into furtherprocessing, instead of the segmented lesion, as it containsthe effective feature representations of the region-of-infections.For further emphasizing the COVID lesion features, a regionalfeature extractor module ( RF ex ) is also proposed that sepa-rately operates on each of the slice-wise fusion vector F fus and thus generates the effective regional feature representation F reg . From Fig. 1, it is to be noted that such regional featureextractor module separately operates on the extracted featurevectors of each CT-slice and hence, enhance the effectiveregional features regarding the infection. The architecturaldetails of this module are presented in Fig. 6. It consists ofseveral stages of convolutional filtering while incorporatingthe Tri-level Attention Unit at each stage. These attentionunits operated at different stages are supposed to executedifferent roles. As we go deeper into this RF ex module,more generalized feature representations are created throughsubsequent pooling operations where the information is mademore sparsely distributed among increased channels. Hence,the attention units at earlier stages enhance the more detailed,localized feature representations, while at deeper stages theattention units learn to expedite the generalization process.Therefore, the regional feature extractor module effectivelyincorporates the proposed tri-level attention mechanism to ex-tract the most generalized representative features of infectionsfrom different regions of the respective CT volume. D. Volumetric Feature Aggregation and Fusion Module
The regional features extracted from each slice of the CTvolume are supposed to optimize through a joint processingmodule for the final diagnosis and severity prediction. Thismodule accumulates the volumetric features from the general-ized feature representation of each slice as well as introducesan effective fusion of features to generate the correspondingrepresentative feature vector of the CT-volume. Moreover, thismodule plays an influential role in the proper selection offeatures especially in the early stage of infection when fewof the slices contain infected lesions. To facilitate the featureselection process, the processing of severity based featuresand diagnostic features are isolated. In Fig. 1, separate volu-metric feature aggregation and fusion modules are integratedto separately optimize the diagnostic and severity features.Though similar operational modules are employed in bothof these cases, another stage of attention-gating operationsis employed to guide the effective slice-wise features inthese operational modules with different objectives (shownin Fig. 1). This module is schematically presented in Fig. 7.Firstly, the volumetric feature accumulation is carried out toproduce the aggregated feature vector F agg from the regionalfeatures ( F reg ) of all slices. Thereupon, the fusion scheme is Fig. 7.
Proposed volumetric feature accumulation and fusion schemeused for severity and diagnostic feature extraction employed utilizing dilated convolutions [23] which providesthe opportunity to explore features from diverse receptiveareas. Firstly, a pointwise convolution (1 × is carried out fordepth reduction of the aggregated vector F agg . Subsequently,several dilated convolutions are operated with varying dilationrates for the effective fusion of features, and outputs of theseconvolutions are processed through another stage of aggrega-tion, convolutional filtering, and global pooling operations togenerate a 1D-representational feature vector. Finally, severalfully connected layer operations are employed for generatingthe final prediction for a specific CT-volume. E. Loss Functions
The optimization of the whole process is divided intotwo phases where the TA-SegNet is optimized in the firstphase and joint optimization of the diagnostic and severityprediction tasks are carried out in the second phase utilizingthe optimized TA-SegNet from phase-1. A focal Tversky lossfunction ( L F T L ) is proposed in [24] utilizing the Tverskyindex that performs well over a large range of applicationswhich is used as the objective function to optimize TA-SegNet.In general, both the COVID diagnosis and severity pre-dictions are defined as binary-classification tasks, wherenormal/disease classes are considered for diagnosis whilemild/severe classes are considered for severity predictions. Forjoint optimization of the diagnosis and severity prediction,an objective loss function ( L obj ) is defined by combiningthe objective loss functions for diagnosis ( L d ) and severityprediction ( L s ). The severity prediction task will only beinitiated for the infected volumes where y d = 1 , while forthe normal cases ( y d = 0 ), this task is ignored. However, the EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 7
TABLE IP
ERFORMANCE (M EAN ± S TANDARD D EVIATION ) OF THE A BLATION S TUDY OF THE P ROPOSED
TA-S EG N ET ON M OS M ED D ATA
Version EfficientNetBackbone EncoderTAU Unit DecoderTAU Unit Encoderin Fusion Decoderin Fusion Dice(%) V1 (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) ± (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) ± (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) ± (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) ± (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) ± (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) ± (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) ± (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) ± TABLE IIC
OMPARISON OF P ERFORMANCES WITH O THER THE S TATE - OF - THE -A RT N ETWORKS ON
COVID L
ESION S EGMENTATION ON M OS M ED D ATA
Networks Sensitivity(%) Precision(%) Dice(%) IoU(%)
FCN [27] 78.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± TA-SegNet (Prop.) 99.6 ± ± ± ± diagnosis task is carried out for all normal/infectious volumes.Hence, the objective loss function ( L obj ) can be expressed as: L obj = L d ( Y d , Y pd ) + L s ( Y d , Y s , Y ps )= 1 M M (cid:88) i =1 L B ( y d,i , y pd,i ) + 1 M I M I (cid:88) i =1 y d,i L B ( y s,i , y ps,i ) (8)where Y d and Y s represent the set of diagnosis and severityground truths while Y pd , Y ps represent the corresponding set ofpredictions, L B denotes binary cross-entropy loss, M denotesthe total number of CT-volumes, and M I represents the totalnumber of infected volumes. Hence, the proposed CovTANetnetwork can be effectively optimized for joint segmentation,diagnosis, and severity predictions of COVID-19 utilizing thistwo phase optimization scheme.III. R ESULTS AND D ISCUSSIONS
In this section, results obtained from extensive experimenta-tion on a publicly available dataset are presented and discussedfrom diverse perspectives to validate the effectiveness of theproposed scheme.
A. Dataset Description
This study is conducted using “MosMedData: Chest CTScans with COVID-19 Related Findings” [25], one of thelargest publicly available datasets in this domain. The dataset,being collected from the hospitals in Moscow, Russia, con-tains 1110 anonymized CT-volumes with severity annotatedCOVID-19 related findings, as well as without such find-ings. Each one of the 1110 CT-volumes is acquired fromdifferent persons and 30-46 slices per patient are available.Pixel annotations of the COVID lesions are provided for 50CT-volumes which are used for training and evaluation ofthe proposed TA-SegNet. For carrying out the diagnosis and severity prediction tasks, all the CT-volumes are divided intonormal, mild ( <
25% lung parenchyma) and severe ( > B. Experimental Setup
With a five-fold cross-validation scheme over the MosMeddataset, all the experimentations have been implemented onthe google cloud platform with NVIDIA P-100 GPU asthe hardware accelerator. For evaluation of the segmentationperformance, some of the traditional metrics are used, suchas accuracy, precision, dice score, and intersection-over-union(IoU) score, while for assessing the severity classificationperformance, accuracy, sensitivity, specificity, and F1-score areused. The Adam optimizer is employed with an initial learningrate of − which is decayed at a rate of 0.99 after every 10epochs. C. Analysis of the Segmentation Performance
Firstly, ablation studies are carried out to validate the effec-tiveness of different modules in TA-SegNet. Afterwards, theperformance of the best performing variant is compared withother networks from qualitative and quantitative perspectives.
1) Ablation study:
Traditional Unet network has been usedas a baseline model (V1) and five other schemes/moduleshave been incorporated in the baseline model to analyzethe contribution of different modules in the performanceimprovement of the proposed TA-SegNet (V8). For easeof comparison, only Dice score is used as it is the mostwidely used metric for segmentation tasks. From Table I,it can be noted that the encoder TAUs (V4) provide 4.1%improvement of the Dice score from the baseline, whilethe decoder TAUs (V5) provide a 2.9% improvement andwhen both these are combined (V6), 6.6% improvement isachieved. As the encoder TAUs contribute significantly to thereduction of semantic gaps with the corresponding decoderfeature maps, while the decoder TAU units guide the decoded
TABLE IIIC
OMPARISON OF P ERFORMANCES WITH O THER THE S TATE - OF - THE -A RT N ETWORKS ON
COVID L
ESION S EGMENTATION ON D ATASET -2 Network Sensitivity(%) Specificity(%) Dice(%) IoU(%)Unet [16] ± ± ± ± MultiResUnet [30] ± ± ± ± Attention-Unet [31] ± ± ± ± CPF-Net [29] ± ± ± ± Gated-Unet [32] ± ± ± ± Inf-Net [12] ± ± ± ± TA-SegNet (ours) ± ± ± ± EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 8
Fig. 8.
Visualization of the lesion segmentation performance of some of the state-of-the-art networks in MosMedData [25] and Dataset-2 [26]. Here,‘green’ denotes the true positive (TP) region, ‘blue’ denotes the false positive region, and ‘red’ denotes the false negative regions.
TABLE IVC
OMPARISON OF P ERFORMANCES IN THE J OINT D IAGNOSIS AND S EVERITY P REDICTION OF
COVID-19
WITH D IFFERENT N ETWORKS ON M OS M ED D ATA
Network Diagnostic Prediction Severity PredictionNormal vs. Mild Normal vs. Severe Mild vs. SevereSen.(%) Spec.(%) Acc.(%) F1(%) Sen.(%) Spec.(%) Acc.(%) F1(%) Sen.(%) Spec.(%) Acc.(%) F1(%)
VGG-19 54.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CovTANet (ours) 83.8 ± ± ± ± ± ± ± ± ± ± ± ± feature maps with finer details for better generalization ofmulti-scale features, considerable performance improvementis achieved when employed in combination. Moreover, all themulti-scale feature maps generated from various encoder levelsare guided to the reconstruction process through a deep fusionscheme along with the multi-scale decoded feature maps. Theintegration of these multi-scale features from the encoder-decoder modules in the fusion process (V3) contributes to theefficient reconstruction, and 4.4% improvement of Dice scoreis achieved over the baseline. Moreover, 9.7% improvement ofDice score is achieved when the fusion scheme is combinedwith two-stage TAU-units (V7). Additionally, for introducingtransfer learning, pre-trained models on the ImageNet databasecan be used as the backbone of the encoder module of theTA-SegNet similar to most other segmentation networks. Itshould be noted that with the pre-trained EfficientNet networkas the backbone of the encoder module (V8), the performancegets improved by 2.1% compared to the TA-SegNet frameworkwithout such backbone (V7).
2) Quantitative analysis:
In Table II, performances of someof the state-of-the-art networks are summarized. It shouldbe noticed that the proposed TA-SegNet outperforms all themethods compared by a considerable margin in all the metrics.Using the proposed framework, 11.8% improvement of Dicescore over Unet, and 26.7% improvement of Dice scoreover the FCN have been achieved. Furthermore, our networkimproves the dice score of the second-best method (Inf-Net) byabout 10.5%, which intuitively indicates its excellent capabili-ties over the rest of the models. The robustness of the proposed scheme and the enhanced capability of our model in terms ofinfected region identification is further demonstrated by thehigh sensitivity score (99.6%) reported. This signifies the factthat the model integrates the symmetric encoding-decodingstrategy of Unet as well as exploits the parallel optimizationadvantages of FCN that provides this large improvement.Most other state-of-the-art variants of the Unet provide sub-optimal performances for increasing complexity considerablythat makes the optimization difficult in most of the challengingcases. However, due to the smaller amount of infections in theannotated CT-volumes used for training and optimization ofthe segmentation networks, a higher amount of false positiveshave been generated in most of the networks which reduced theprecision. The proposed TA-SegNet has considerably reducedthe false positives along with false negatives and has improvedboth precision and sensitivity.
3) Qualitative analysis:
In Fig. 8, qualitative representa-tions of the segmentation performances of different networksare shown in some challenging conditions. The comparabledimensions of the small infected regions and the arteries,veins embedded in the thorax cavity with varying anatomicalappearance might be attributed to the large occurrences of thefalse positives. It is evident that most other networks struggleto extract the complicated, scattered, and diffused COVID-19lesions, while the proposed TA-SegNet considerably improvesthe segmentation performance in such challenging conditions.This depiction conforms to the fact that our network cancorrectly segment both of the large and small infected regions.Furthermore, our framework consistently demonstrates almost
EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 9 non-existent false negatives compared to the other modelswhile considerably reducing the false positive predictions asit can distinguish sharper details of the lesions and effectivelyperform for early diagnosis of the infection.
D. Performance on Secondary Dataset
In Table III, quantitative performances on Dataset-2 [26]for different networks are summarized. It should be noticedthat the proposed TA-SegNet provides the best achievableperformances with . mean Dice score while providing . improvement over Unet. In Fig. 8, the qualitativeperformance analyses are provided on several challengingexamples that further demonstrate the effectiveness of theproposed TA-SegNet over other traditional networks with largereductions in both false positives and false negatives. However,it should be mentioned that this dataset contains mostly higherlevels of infections in the CT volumes on average that makesthe learning and optimization more favorable compared to theMosMedData, and thus, comparatively higher segmentationperformances have been achieved. E. Analysis of the Joint Classification Performance
In Table IV, the performances obtained from the jointdiagnosis and severity prediction tasks are summarized. Toanalyze the effectiveness of the proposed multi-phase opti-mization scheme, some of the state-of-the-art networks are alsoevaluated for the slice-wise processing of the CT-volumes inthe joint-classification scheme discarding the TA-SegNet.
1) Diagnostic prediction performance analysis:
The diag-nosis performances with mild and severe cases of COVID-19 are separately reported to distinguish the early diagno-sis performance. The proposed CovTANet provides 85.2%accuracy in isolating the COVID patients even with mildsymptoms, while the accuracy is as high as 95.8% whenthe CT volumes contain severe infections. However, the othernetworks operating without the TA-SegNet noticeably sufferespecially in the early diagnosis phase, as it is difficult toisolate the small infection patches from the CT-volume. Hence,it can be interpreted that this high early diagnostic accuracyof CovTANet is significantly contributed by the multi-phaseoptimization scheme that incorporates the highly optimizedTA-SegNet for extracting the most effective lesion features tomitigate the effect of redundant healthy parts.
2) Severity prediction performance analysis:
In the jointoptimization process based on the amount of infected lungparenchymas, mild and severe patients are also categorized.Despite the additional challenges regarding the isolation andquantification of the abnormal tissues, the proposed schemegeneralizes the problem quite well which provides 91.7%accuracy in categorizing mild and severe patients. It should benoted that the highest achievable severity prediction accuracywith a traditional network is 64.8% (using ResNet50) withconsiderably smaller results in most other metrics. Traditionalnetwork directly operates on the whole CT-volume to extracteffective features for severity prediction which makes the taskmore complicated. Whereas, the proposed hybrid CovTANetwith multiphase optimization effectively integrates features regarding infections from the TA-SegNet for considerably sim-plifying the feature extraction process in the joint-classificationprocess that results in higher accuracy.IV. C
ONCLUSION AND F UTURE W ORKS
In this study, a multi-phase optimization scheme is pro-posed with a hybrid neural network (CovTANet) where anefficient lesion segmentation network is integrated into acomplete optimization framework for joint diagnosis andseverity prediction of COVID-19 from CT-volume. The tri-level attention mechanism and parallel optimization of multi-scale encoded-decoded feature maps which are introducedin the segmentation network (TA-SegNet) have improvedthe lesion segmentation performance substantially. Moreover,the effective integration of features from the optimized TA-SegNet is found to be extremely beneficial in diagnosis andseverity prediction by de-emphasizing the effects of redundantfeatures from the whole CT-volumes. It is also shown that theproposed joint classification scheme not only provides betterdiagnosis at severe infection stages but also is capable of earlydiagnosis of patients having mild infections with outstandingprecision. Furthermore, considerable performances have beenachieved in severity screening that would facilitate a fasterclinical response to substantially reduce the probable damages.Nonetheless, a further study should be carried out consideringpatients from diverse geographic locations to understand themutation and evolution of this deadly virus where the proposedhybrid network is supposed to be very effective. The proposedscheme can be a valuable tool for the clinicians to combat thispernicious disease through faster-automated mass-screening.R
EFERENCES[1] G. Meyerowitz-Katz and L. Merone, “A systematic review and meta-analysis of published research data on COVID-19 infection fatalityrates,”
International Journal of Infectious Diseases , vol. 101, pp. 138 –148, 2020.[2] Y. Fang, H. Zhang, J. Xie, M. Lin, L. Ying, P. Pang, and W. Ji,“Sensitivity of chest CT for COVID-19: comparison to RT-PCR,”
Radiology , vol. 296, no. 2, pp. E115–E117, 2020.[3] F. Shi, J. Wang, J. Shi, Z. Wu, Q. Wang, Z. Tang, K. He, Y. Shi, andD. Shen, “Review of artificial intelligence techniques in imaging dataacquisition, segmentation and diagnosis for COVID-19,”
IEEE Reviewsin Biomedical Engineering , 2020.[4] M. Abdel-Basset, V. Chang, and N. A. Nabeeh, “An intelligent frame-work using disruptive technologies for COVID-19 analysis,”
Technolog-ical Forecasting and Social Change , p. 120431, 2020.[5] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang,Q. Song et al. , “Artificial intelligence distinguishes COVID-19 fromcommunity acquired pneumonia on chest CT,”
Radiology , vol. 296,no. 2, pp. E65–E71, 2020.[6] L. Huang, R. Han, T. Ai, P. Yu, H. Kang, Q. Tao, and L. Xia,“Serial quantitative chest CT assessment of COVID-19: Deep-learningapproach,”
Radiology: Cardiothoracic Imaging , vol. 2, no. 2, p. e200075,2020.[7] M. Abdel-Basset, V. Chang, H. Hawash, R. K. Chakrabortty, andM. Ryan, “FSS-2019-nCov: A deep learning architecture for semi-supervised few-shot segmentation of COVID-19 infection,”
Knowledge-Based Systems , p. 106647, 2020.[8] T. Mahmud, M. A. Rahman, and S. A. Fattah, “CovXNet: A multi-dilation convolutional neural network for automatic COVID-19 and otherpneumonia detection from chest X-ray images with transferable multi-receptive feature optimization,”
Computers in Biology and Medicine , p.103869, 2020.
EEE TRANSACTIONS ON INDUSTRIAL INFORMATICS 10 [9] M. Abdel-Basset, V. Chang, and R. Mohamed, “HSMA WOA: Ahybrid novel slime mould algorithm with whale optimization algorithmfor tackling the image segmentation problem of chest X-ray images,”
Applied Soft Computing , vol. 95, p. 106642, 2020.[10] J. P. Kanne, B. P. Little, J. H. Chung, B. M. Elicker, and L. H.Ketai, “Essentials for radiologists on COVID-19: an update—radiologyscientific expert panel,”
Radiology , vol. 296, no. 2, pp. E113–E114,2020.[11] J. T. Wu, K. Leung, M. Bushman, N. Kishore, R. Niehus, P. M.de Salazar, B. J. Cowling, M. Lipsitch, and G. M. Leung, “Estimatingclinical severity of COVID-19 from the transmission dynamics inWuhan, China,”
Nature Medicine , vol. 26, no. 4, pp. 506–510, 2020.[12] D. P. Fan, T. Zhou, G. P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, andL. Shao, “Inf-Net: Automatic COVID-19 lung infection segmentationfrom CT images,”
IEEE Transactions on Medical Imaging , vol. 39, no. 8,pp. 2626–2637, 2020.[13] Y. Qiu, Y. Liu, and J. Xu, “Miniseg: An extremely minimum networkfor efficient COVID-19 segmentation,” arXiv preprint arXiv:2004.09750 ,2020.[14] G. Wang, X. Liu, C. Li, Z. Xu, J. Ruan, H. Zhu, T. Meng, K. Li,N. Huang, and S. Zhang, “A noise-robust framework for automaticsegmentation of COVID-19 pneumonia lesions from CT images,”
IEEETransactions on Medical Imaging , vol. 39, no. 8, pp. 2653–2663, 2020.[15] L. Huang, R. Han, T. Ai, P. Yu, H. Kang, Q. Tao, and L. Xia,“Serial quantitative chest CT assessment of COVID-19: Deep-learningapproach,”
Radiology: Cardiothoracic Imaging , vol. 2, no. 2, p. e200075,2020.[16] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in ,2015, pp. 234–241.[17] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:Redesigning skip connections to exploit multiscale features in imagesegmentation,”
IEEE Transactions on Medical Imaging , vol. 39, no. 6,pp. 1856–1867, 2019.[18] A. Bernheim, X. Mei, M. Huang, Y. Yang, Z. A. Fayad, N. Zhang,K. Diao, B. Lin, X. Zhu, K. Li et al. , “Chest CT findings in coro-navirus disease-19 (COVID-19): relationship to duration of infection,”
Radiology , vol. 295, no. 3, p. 200463, 2020.[19] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical im-age segmentation,”
IEEE Journal of Biomedical and Health Informatics ,2020.[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin Neural Information Processing Systems , 2017, pp. 5998–6008.[21] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in , 2018, pp.7132–7141.[22] Z. Tan, Y. Yang, J. Wan, H. Hang, G. Guo, and S. Z. Li, “Attention-basedpedestrian attribute analysis,”
IEEE Transactions on Image Processing ,vol. 28, no. 12, pp. 6126–6140, 2019.[23] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” arXiv preprint arXiv:1511.07122 , 2015.[24] N. Abraham and N. M. Khan, “A novel focal tversky loss function withimproved attention U-net for lesion segmentation,” in , 2019, pp. 683–687.[25] “MosMedData: Chest CT Scans with COVID-19 Related Findings,”2020, accessed: 28 April, 2020. [online]. Available: https://mosmed.ai/datasets/covid19 1110.[26] “COVID-19 CT Lung and Infection segmentation dataset,” 2020, ac-cessed: 16 October, 2020. [online]. Available: https://zenodo.org/record/3757476.[27] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in , 2015, pp. 3431–3440.[28] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in , 2016, pp. 565–571.[29] S. Feng, H. Zhao, F. Shi, X. Cheng, M. Wang, Y. Ma, D. Xiang, W. Zhu,and X. Chen, “CPFNet: Context pyramid fusion network for medicalimage segmentation,”
IEEE Transactions on Medical Imaging , vol. 39,no. 10, pp. 3008–3018, 2020.[30] N. Ibtehaz and M. S. Rahman, “MultiResUNet: Rethinking the U-Netarchitecture for multimodal biomedical image segmentation,”
NeuralNetworks , vol. 121, pp. 74–87, 2020. [31] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa,K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al. , “Atten-tion U-net: Learning where to look for the pancreas,” arXiv preprintarXiv:1804.03999 , 2018.[32] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker,and D. Rueckert, “Attention gated networks: Learning to leverage salientregions in medical images,”