[PDF] A Multi-Stage Attentive Transfer Learning Framework for Improving COVID-19 Diagnosis

Abstract

Computed tomography (CT) imaging is a promising approach to diagnosing the COVID-19. Machine learning methods can be employed to train models from labeled CT images and predict whether a case is positive or negative. However, there exists no publicly-available and large-scale CT data to train accurate models. In this work, we propose a multi-stage attentive transfer learning framework for improving COVID-19 diagnosis. Our proposed framework consists of three stages to train accurate diagnosis models through learning knowledge from multiple source tasks and data of different domains. Importantly, we propose a novel self-supervised learning method to learn multi-scale representations for lung CT images. Our method captures semantic information from the whole lung and highlights the functionality of each lung region for better representation learning. The method is then integrated to the last stage of the proposed transfer learning framework to reuse the complex patterns learned from the same CT images. We use a base model integrating self-attention (ATTNs) and convolutional operations. Experimental results show that networks with ATTNs induce greater performance improvement through transfer learning than networks without ATTNs. This indicates attention exhibits higher transferability than convolution. Our results also show that the proposed self-supervised learning method outperforms several baseline methods.

Full PDF

11 A Multi-Stage Attentive Transfer LearningFramework for Improving COVID-19 Diagnosis

Yi Liu, and Shuiwang Ji,

Senior Member, IEEE

Abstract —Computed tomography (CT) imaging is a promising approach to diagnosing the COVID-19. Machine learning methods canbe employed to train models from labeled CT images and predict whether a case is positive or negative. However, there exists nopublicly-available and large-scale CT data to train accurate models. In this work, we propose a multi-stage attentive transfer learningframework for improving COVID-19 diagnosis. Our proposed framework consists of three stages to train accurate diagnosis modelsthrough learning knowledge from multiple source tasks and data of different domains. Importantly, we propose a novel self-supervisedlearning method to learn multi-scale representations for lung CT images. Our method captures semantic information from the wholelung and highlights the functionality of each lung region for better representation learning. The method is then integrated to the laststage of the proposed transfer learning framework to reuse the complex patterns learned from the same CT images. We use a basemodel integrating self-attention (ATTNs) and convolutional operations. Experimental results show that networks with ATTNs inducegreater performance improvement through transfer learning than networks without ATTNs. This indicates attention exhibits highertransferability than convolution. Our results also show that the proposed self-supervised learning method outperforms severalbaselines methods.

Index Terms —COVID-19 diagnosis, transfer learning, self-supervised learning, attention, transferability, medical image computing. (cid:70)

NTRODUCTION T HE COVID-19 pandemic has spread rapidly and in-fected millions of people worldwide. A critical stepto ﬁght against the spreading of COVID-19 is effectivediagnosis of the infected cases [1], [2]. The commonly usedapproach for COVID-19 diagnosis is reverse transcriptionpolymerase chain reaction (RT-PCR). It usually takes hoursfor results and test kits are in great shortage in somecounties and areas [3], [4]. Computed tomography (CT)-aided diagnosis has become a weighty alternative due to itsnature of wide availability and easy accessibility [3], [5], [6].To accelerate the reading of CT images, machine learningapproaches have been employed to learn patterns fromlabeled images and then automatically make prediction forany newly obtained CT image [7], [8], [9], [10].A major challenge of CT-aided automatic diagnosis is thelack of labeled data [7], [11], [12]. To date, the largest labeledCT dataset [12] that is publicly available only containsseveral hundred images. Models trained on such small-scale datasets may generate unsatisfactory prediction resultsfor newly obtained CT images. It is natural to leveragetransfer learning to train more powerful models for accurateCOVID-19 diagnosis. Transfer learning is a popular machinelearning technique that learns a model from a source taskwhere the labeled data is sufﬁcient and transfers the learnedknowledge to a target task [13], [14], [15]. Existing work [7],[12] mainly focuses on using pretrained deep neural net-works (DNNs) to improve the prediction performance forthe target task of COVID-19 prediction. However, thereis no work examining which network component, suchas a convolutional layer or an attention layer, can induce

Yi Liu and Shuiwang Ji are with Department of Computer Science &Engineering, Texas A&M University, College Station, TX 77843, USA (email:[email protected], [email protected]). larger performance improvement through transfer learning.In addition, there is little work presenting a uniﬁed transferlearning framework for medical image analysis, especiallyfor COVID-19 CT image prediction.In this work, we propose a multi-stage attentive trans-fer learning framework for improving CT-based COVID-19 diagnosis. Our proposed framework is composed ofthree stages based on the machine learning approaches anddata used in source tasks. First, we perform supervisedtransfer learning from natural images (STL-N) and super-vised transfer learning from medical images (STL-M) tolearn knowledge from large-scale labeled natural imagesand medical images, respectively. After that, we designa novel self-supervised task and perform self-supervisedtransfer learning from medical images (SSTL-M) to extractcomplex patterns from the used medical CT images. Forthe networks used in the transfer learning framework, weintegrate self-attention layers (ATTNs) into convolutionalneural networks (CNNs) such as ResNets to compare trans-ferability of convlotional layers and ATTNs. Speciﬁcally, Wecategorize networks into two groups, one of which containsResNets and the other contains the same ResNets withATTNs inserted in. Then both groups are pretrained on thesame source tasks and data to compare transferability of thetwo groups.We perform self-supervised transfer learning in the laststage of the proposed transfer learning framework. Existingself-supervised methods [16], [17], [18], [19], [20] achievesuperior results on natural images but usually generate poorpredictions on medical images. By referring to biologicaldomain knowledge of substructures of the human lung,we design a novel self-supervised method to learn multi-scale representations for lung CT images. Our method iscapable of learning representations at both the image level a r X i v : . [ ee ss . I V ] J a n and the region level. By doing this, sufﬁcient semanticinformation from the whole lung is captured and the func-tionality of each lung region is highlighted for better rep-resentation learning. Then self-supervised transfer learningis performed to reuse the complex inherent patterns learnedfrom the same CT images to improve the performance of thetarget task.We conduct extensive experiments to evaluate our pro-posed approach. Experimental results show that after pre-trained with our proposed multi-stage transfer learningframework, networks with ATTNs achieve much better per-formance for CT-aided COVID-19 prediction compared withthe baseline ResNets. This indicates the effectiveness of inte-grating ATTNs into our transfer learning framework. Moreimportantly, it is shown that networks with ATTNs resultin much larger performance improvement through transferlearning compared with convolutional layers. This pointsout that compared with convolution, attention can transferknowledge from source tasks to target tasks more eas-ily, which essentially reveals that attention exhibits highertransferability than convolution. In addition, we show thatour proposed self-supervised learning method achieves bestperformance compared with several SOTA baselines. Thisdemonstrates the effectiveness of our method of learningmulti-scale representations of lung CT images and high-lighting the functionality of each lung region. Our quali-tative results demonstrate that using attention for transferlearning can successfully detect important regions for pre-diction. Overall, our major contributions are summarized asfollows: • We propose a multi-stage attentive transfer learningframework for improving CT-aided COVID-19 diag-nosis. The proposed framework successfully learnsand transfers knowledge from multiple source tasksand data of different domains for accurate COVID-19diagnosis. • We propose a novel self-supervised learning methodfor medical images. Our method enables multi-scalerepresentation learning for lung CT images and out-performs existing self-supervised learning methods. • We not only show that networks with attention layersare more powerful through transfer learning, but alsodemonstrate that attention has higher transferabilitythan convolution. To our best knowledge, this is theﬁrst work to compare transferability of attention andconvolution.

ELATED W ORK

In this section, we introduce related work of transfer learn-ing and attention mechanism.

Transfer learning aims at transferring knowledge acrossdifferent tasks [13], [15]. Generally, it learns knowledge froma source task and transfers the knowledge to a target task.In practice, it is usually difﬁcult to collect sufﬁcient trainingdata for a target task. Training a model on insufﬁcientdata may result in unsatisfactory prediction results. Transferlearning is used to ﬁrst train a model on a source task where the training data is sufﬁcient. Then the pretrained modelserves as the starting point and is ﬁnetuned on the targettask. For instance, in visual recognition, a model is usu-ally trained on ImageNet that contains millions of labeledtraining samples. After that, the trained weights are usedas initial weights for downstream tasks such as semanticsegmentation and objective detection. Transfer learning hasachieved success across various artiﬁcial intelligence do-mains, including natural language processing [21] computervision [22], and biomedical image analysis [23], [24].There exist several categorization criteria of transferlearning [13], [14]. Based on the machine learning ap-proaches used in the source task, transfer learning could becategorized into supervised transfer learning, unsupervisedtransfer learning, and semi-supervised transfer learning, etc.Generally, supervised and semi-supervised learning havebeen studied intensively, and unsupervised learning is apromising research area as labeling is usually expensive.Self-supervised learning is a type of unsupervised learn-ing strategy that has gained more and more popularityrecently [16], [17], [25]. It aims at supervised feature learn-ing where the supervision is provided by the data. Thesupervised tasks are the key for self-supervised learning.Earlier work for supervised tasks on images basically pre-dicts positions or context for a local patch [16], [17]. Recentwork [26], [27], [28] mainly performs two random setsof data augmentations on a pair of images and predictswhether the two images are the same or not. The contrastiveloss is commonly used for these methods. The input for thecontrastive loss contains a query vector x q from an image X , a key vector x k + from the same image X , and another n key vectors from n images that are different from X . Thenthe contrastive loss is essentially a log-loss of a ( n + 1) -waysoftmax classiﬁer that tries to classify x q to x k + rather thanthe other n key vectors. In this section, we describe the attention mechanism, whichcaptures long-range dependencies from input [29], [30].Given the input tensor

X ∈ R h × w × c , the attention mech-anism ﬁrst performs × convolution three times andachieves three tensors; those are, the query Q ∈ R h × w × d k ,the key K ∈ R h × w × d k and the value V ∈ R h × w × d v . Thesetensors are unfolded into three matrices along mode-3 [31],resulting in Q ∈ R d k × hw , K ∈ R d k × hw and V ∈ R d v × hw ,respectively. Then the intermediate output is computed as O = V × Softmax ( K T Q ) ∈ R d v × hw , (1)where Softmax ( · ) is performed on columns such that everycolumn sums to 1. Finally, the obtained matrix O is con-verted back to a tensor O ∈ R h × w × d v , as the ﬁnal output ofthe attention mechanism.Essentially, ( K T Q ) generates a matrix of sizes hw × hw ,which can be treated as hw attention heatmaps. Eachheatmap contains hw attention weights, and O is computedas a weighted sum of all vectors in V . To this end, theresponse at each position of the output O is dependent onall positions of V , which is just achieved by performing alinear transformation on the input X . As a result. long-range LabelsSTL-N STL-M TaskSSTL-M FinetuneSource Tasks Target TaskTransferLearnedKnowledge a Labels Labelsres2 res3 res4 res5 b c on v l l oba l poo l poo l f cc on vs c on vs c on vs c on vsc on v l l f cc on vs c on vs A TT N s c on vs A TT N s c on vs A TT N s g l oba l poo l Fig. 1.

The proposed multi-stage transfer learning framework and network settings. a . Our proposed multi-stage transfer learning frameworkthat contains three source tasks, these are, supervised transfer learning from natural images (STL-N), supervised transfer learning from medicalimages (STL-M), and self-supervised transfer learning from medical images(SSTL-M). Then the learned knowledge is transferred to the target taskof COVID-19 diagnosis based on CT images. There exist a large amount of labeled data in the STL-N and STL-M. In the SSTL-M, we carefullydesign a task based on the data to extract complex inherent patterns from medical images. b . Two groups of network settings. The top onerepresents a standard ResNet architecture. The bottom one illustrates a ResNet with self-attention layers (ATTNs) inserted in the res , res andres blocks. dependencies from the input are captured by attention. No-tably, the output of attention is input dependent. Differentfrom convolution where weights are learnable parameters,attention weights are computed from the input. ULTI -S TAGE A TTENTIVE T RANSFER L EARN - ING F RAMEWORK

In this section, we introduce our proposed multi-stage at-tentive transfer learning framework and network settings.

Given a new CT image, our objective is to predict whetherit’s COVID-19 positive or negative based on the trainedmodel. However, existing COVID-19 CT data is small-scaleand insufﬁcient to train a powerful model, which usuallyleads to poor prediction performance. It’s natural to lever-age transfer learning to achieve more powerful models andboost the performance of COVID-19 prediction. An illustra-tion of our proposed multi-stage transfer learning frame-work is provided in Figure 1a. Speciﬁcally, we ﬁrst conducttwo supervised source tasks, namely supervised transferlearning from natural images (STL-N) and supervised trans-fer learning from medical images (STL-M), to learn modelsfrom large-scale labeled data for the target task. After that,we perform self-supervised transfer learning from medicalimages (SSTL-M) to learn complex inherent patterns fromthe used CT images.

For the networks used in our transfer learning framework,we categorize them into two groups to compare transfer-ability between self-attention layers (ATTNs) and convo-lutional layers. The ﬁrst group contains standard CNNssuch as ResNet-50 and ResNet-101. For the other group,we follow the settings in the work [29] and insert ATTNsin these backbone ResNets. An illustration of the networksettings is provided in Figure 1b. Generally, there exist fourresidual blocks in the family of ResNets, namely res , res ,res and res , respectively [32]. It is shown that networkswith ATTNs inserted in the res and res obtain the bestperformance [29]. We use similar strategies and insert mostATTNs in the res and res . In addition, we propose to insertanother ATTN in the res before the global average poolingto qualitatively demonstrate the effectiveness of ATTNs intransfer learning. Then both the groups of networks areapplied to our multi-stage transfer learning framework tocompare transferability of the two groups. Essentially, net-works with and without ATTNs are pretrained on the samesource tasks to compare which of ATTNs and convolutionallayers can transfer knowledge from these source tasks moreeasily. UPERVISED T RANSFER L EARNING

The labeled CT data for COVID-19 diagnosis is limited.Training networks directly on these CT images may resultin poor performance for COVID-19 detection. There exist large-scale labeled datasets from other domains or diseases.We use supervised transfer learning to learn and transferknowledge from these labeled data to facilitate the CT-aidedCOVID-19 diagnosis.We ﬁrst perform STL-N on ImageNet, a large-scale col-lection of natural images and the most popular labeleddataset for model pretraining. Both the groups of networksintroduced in Section 3.2 are pretrained on ImageNet tolearn knowledge from natural images. When applied to newtasks such as COVID-19 CT image prediction, the trans-ferability of two different categories can be estimated bythe performance improvement induced by transfer learning.Notably, networks with and without ATTNs are pretrainedon ImageNet to compare the transferability of ATTNs andconvolution layers from natural images.It is obvious that nature images and medical images(such as CT images) follow different distributions. Pretrain-ing on natural images enables models to learn commonpatterns shared by natural and medical images, but fails tolearn distinguishing patterns for medical images. Hence, weconduct STL-M to pretrain models on existing large-scalelabeled medical images. Even though labeled CT imagesfor COVID-19 diagnosis are scarce, there exist redundantsources for annotated medical images from other domains,such as chest X-ray (CXR) images for COVID-19, or CTimages for regular pneumonia. Pretraining on these medicalimages enables models to learn inherent patterns in medicalimages and extract strong features for accurate COVID-19diagnosis. Similar to STL-N, the two groups of networksthat with and without ATTNs are both pretrained on thelabled medical images to compare transferability of thesetwo groups. Notably, by performing the two stages of su-pervised transfer learning STL-N and STL-M, we comparetransferability of ATTNs and convolutional layers on twoscenarios, where the source data follows different or similardistributions with the target data.

ELF - SUPERVISED T RANSFER L EARNING

We proposed to transfer knowledge from labeled natu-ral and medical images by performing supervised trans-fer learning STL-N and STL-M. However, there still ex-ists divergence in distributions between the labeled sourcedata and the target data. Notably, medical images fromother modalities or diseases still have a domain shift fromthe lung CT images for COVID-19. Hence, we performself-supervised transfer learning as the last stage in ourframework to obtain knowledge from the same CT im-ages. Existing self-supervised learning methods achievegood performance on natural images but usually result inpoor performance on medical images (like CT images). Inthis section, we introduce a novel self-supervised learningmethod for medical images and transfer the learned knowl-edge to COVID-19 detection. We denote this stage as self-supervised transfer learning from medical images (SSTL-M), as illustrated in Figure 1a. The objective of SSTL-M isto learn complex and inherent patterns from CT images byperforming a carefully-designed self-supervised task.

It is vital to design an appropriate source task to obtainredundant information from the CT images. Currently, self-supervised learning methods, including MoCo (v1 [26] andv2 [27]) and simCLR [28], have achieved the SOTA perfor-mance on tasks for natural images, but usually result inpoor performance on tasks related to medical images [33].Essentially, these methods apply two sets of random dataaugmentations on the same image then force the networkmake a positive prediction. A positive pair contains twosame images (one image actually). Negative pairs are addedwhere a negative pair contains two different images. Bydoing this, inherent patterns from input images are learnedand can be transferred to other target tasks. However, thedata augmentation methods used in these tasks are com-monly used techniques (such as rotation, ﬂipping) and maynot be strong enough to extract semantic patterns frommedical images. In addition, medical images such as lungCT images are usually symmetric in structure. Applyingdata augmentation on the whole image may fail to extractdistinguishing features from a speciﬁc region.In this work, we propose a novel self-supervised learn-ing method to learn multi-scale representations for lung CTimages, as illustrated in Figure 2. Our proposed method iscomposed of two branches, these are, image-scale represen-tation learning for the whole lung structure, and region-scale learning for different substructures of a lung. We usea contrastive self-supervised pipeline [26] for the former.For the latter, we design a task referring to prior domainknowledge based upon biological structures of the humanlung. Speciﬁcally, humans have two lungs, a right lung, anda left lung. The right lung has three lobes, the upper lobe,the middle lobe, and the lower lobe. The left lung is a littlesmaller and composed of two lobes, the upper lobe and thelower lobe [34]. These lobes play different roles in biologicalprocesses. For some diseases, infection of a speciﬁc lobe canserve as an important indicator for medical diagnosis [5],[34]. Hence, it’s important to learn inherent patterns for eachspeciﬁc lobe and highlight structural divergence among alllobes.

We ﬁrst perform image-scale learning based on the wholeCT image. A contrastive self-supervised learning frameworkis used, an input sample of which contains a CT image X to form a positive pair, and a set of different CT images { X i | i = 1 , ..., n } to form negative pairs. Speciﬁcally, twosets of random augmentations are applied on the sameimage X , which results in two images X q and X k . X q is passed to the query encoder to generate a query vector x q , and X k is passed to the key encoder to generate a keyvector x k + . In addition, images { X i | i = 1 , ..., n } are alsotaken by the same key encoder to obtain a set of key vectors { x ik − | i = 1 , ..., n } . As a result, ( x q , x k + ) forms a positivepair and ( x q , x ik − ) form n negative pairs for the contrastiveloss, which is expressed as L cons = − log exp ( x q · x k + /τ ) exp ( x q · x k + /τ ) + (cid:80) ni =1 exp ( x q · x ik − ) , (2) GeneratorData Aug. 2Data Aug. 1 ℒ " ℒ &' ℒ Image-scale LearningRegion-scale Learning Two Groups of Networks ( ) ( * OR ++ + &, + -- . / . . &, . -- c on vsc on vs A TT N s Fig. 2.

The proposed self-supervised transfer learning method.

The method is composed of the image-scale learning branch and the region-scale learning branch. For the purpose of simplicity, we only illustrate the scenario for a positive pair based on an input lung CT image X but omit thescenario for negative pairs. For image-scale learning, two sets of random data augmentations are applied on X , denoted as data aug. 1 and dataaug. 2. This results in two augmented images. For the region-scale learning, ﬁve regions, including X ru , X rm , X rl , X lu and X ll and generatedfrom the image X by the Generator operator as deﬁned in Equation 4. Generator is composed of three operators, these are, the locate operator,the crop operator and the resize operator. All of the intermediate outputs of the two branches are passed to the same network. We have two groupsof networks as introduced in Section 3.2 and Figure 1b. The switch OR means that we choose either of the two groups for one experiment. And weconduct experiments on both the group to compare the transferability of ATTNs and convolutional layers. In image-scale learning, the networks inputtwo augmented images and output two reprsenantions x q and x k + , based on which the contrastive loss L cons (deﬁned in Equation 2) could becomputed. Region-scale learning is a multi-task learning task to predict the predeﬁned class of each region. The region-aware loss L ra (deﬁned inEquation 5) for the image X is simply the sum of the cross-entropy losses of the ﬁve regions. The ﬁnal loss L is the weighted sum of the contrastiveloss L cons of the image-scale learning and L ra of the region-scale learning. Note again that we neglect negative pairs for clear illustration, and theversion for L considering negative pairs is deﬁned in Equation 6. where τ is a temperature parameter. Intuitively, L cons is thelog-loss of a ( n + 1) -way softmax classiﬁer whose input is x q and the correct label is x k + .Generally, the query encoder and the key encoder sharethe same network architecture, but differ in weight update.The weights of the query encode are updated by back-propagation. Different from the query encoder, the key en-coder takes as input a large set of images { X i | i = 1 , ..., n } ,which usually makes it intractable to update the weights byback-propagation. The MoCo [26], [27] adopts a momentumupdate strategy for the key encoder as mθ k + (1 − m ) θ q → θ k , (3)where m is the momentum parameter that usually takes alarge number such as 0.999, θ k and θ q denote the weightsof the key encoder and the query encoder, respectively.Notably, we also use two groups of networks for the queryand key encoders to compare the transferability of ATTNsand convolutional layers. We propose to learn distinguishing patterns in ﬁve lobesin region-scale learning. From this perspective, we carefullydesign a task to predict the positions of all the ﬁve lobes.Given an input lung CT image X , we ﬁrst generate ﬁveregions that cover the ﬁve lobes as X ru , X rm , X rl , X lu , X ll = Generator ( X ) , (4)where X ru , X rm , X rl , X lu , X ll denote the regions coverthe right lung upper lobe, the right lung middle lobe, the right lung lower lobe, the left lung upper lobe, the left lunglower lobe, respectively, Generator denotes a compositionof operators to generate the above ﬁve region from an inputlung CT image.Generator is composed of three operators, includingthe locate, crop and resize operators. The locate operatorcontains three steps. First, we compute a location tuple ( x i , y i , w i , h i ) for each region i ∈ { ru, rm, rl, lu, ll } fromFigure 2, where x i and y i denote the coordinates of thecenter point of the region i , w i and h i denote the widthand height of the region i , respectively. After that, for agiven lung CT image, we generate a boundary map thatseparates the lung and its peripheral tissue, then an imagethat only contains the lung can be achieved based on theboundary map. To generate a boundary map, we slide a2D kernel with sizes × pixel by pixel on the inputlung image. If a window contains more than one distinctpixel value, the center pixel of this window is marked as aboundary pixel. At last, we compute the ﬁve location tuplesfor the achieved lung image based on the location tuples ( x i , y i , w i , h i ) , where i ∈ { ru, rm, rl, lu, ll } . This is becausethe position and spatial sizes of each region for the humanlung are roughly ﬁxed. We then use the crop operator tocrop out ﬁve regions based on the location tuples. Finally,the resize operator is performed such that each region isresized to the original lung CT image’s sizes.We employ either of the above two groups of networks togenerate a region representation for each region. In this way,the networks are forced to understand and learn inherentknowledge for each lobe of a lung. After that, classiﬁers are used to predict the positions of all the ﬁve lobes. It’sessentially a multi-task learning problem. The input is theﬁve regions, each of which covers a lobe. For a region, theright class is the predeﬁned index for it. Formally, givenan input lung CT image X , the region-aware loss L ra forregion-scale learning is computed by L ra = (cid:88) i ∈{ ru,rm,rl,lu,ll } CE ( x i , y i ) , (5)where CE ( x i , y i ) is the cross-entropy loss for the region i ∈{ ru, rm, rl, lu, ll } , x i ∈ R is the output of the classiﬁer,and y i ∈ R is a one-hot vector indicating the right class forthe region i . Notably, the parameters are shared across allthe ﬁve input regions. The proposed region-scale learningtask has two advantages. Firstly, it learns speciﬁc patternsfor each lobe, thereby extracting subtle information at thelobe-level. Secondly, by treating each lobe as a center lobe,the other four lobes can be viewed as the context for thecenter lobe. Thus the center lobe is highlighted and doingwell on this task requires distinguishing representation ofeach lobe. As introduced in Section 5.2, an input sample for our self-supervised learning framework contains a CT image X to form a positive pair, and a set of different CT images { X i | i = 1 , ..., n } to form negative pairs. Formally, the ﬁnalloss L for this input sample is computed by performingthe weighted sum on the contrastive loss of the image-scalelearning and the proposed region-aware loss of the region-scale learning as L = α L cons + α ( L ra + n (cid:88) i =1 L ira ) , (6)where α and α are hyper-parameters, L cons is deﬁned inEquation 2, L ira is the region-aware loss for the image X ,and L ira is the region-aware loss for an image X i from theimage set { X i | i = 1 , ..., n } .During back-propagation, the network is forced to learnrepresentations at both the lung-level and lobe-level, dis-tinguishing the functionality of each lobe and capturingsufﬁcient semantic information to achieve better represen-tations for lung CT images. Note that we use the samenetwork in two branches. That is to say, the weights areshared for the feature extractors used in both branches.Similar to supervised transfer learning, we apply each ofthe two groups of networks as the backbone network in theself-supervised learning framework and then transfer theknowledge to the target task. The transferability of ATTNsand convolution layers on the self-supervised learning isthen explored. We use two approaches to estimate the transferability oflearned representations. We denote a network that is notpretrained on a source task as N n , and the same networkthat is pretrained on a source task as N p . First, we directlyﬁnetune the pretrained N n and N p on the target task and record the prediction performance. The metrics include ac-curacy, F1-score and AUC. For any metric, the performanceof N n and N p is denoted as P n and P p , respectively. Thenthe divergence of each metric between P n and P p canbe used to estimate the transferability of representationslearned by the employed network on the source task.In addition to evaluating transferability through runningexperiments on the target task, we employ LEEP [35] todirectly estimate transferability based on the pretrainedmodel and statistics of the target dataset. LEEP can onlybe applied to supervised transfer learning where the sourcedata has labels. Assume labels of the source data are in alabel set Z , input instances of the target data are in thedomain set X ∈ R N , and labels of the target data are in alabel set Y . Given a pretrained network N and target dataset D = { ( x , y ) , ( x , y ) , ..., ( x n , y n ) } , where n is the numberof data samples in the target set, x i ∈ X is achieved byﬂatting an image to a vector. Formally, the LEEP score L cancomputed as L = 1 n n (cid:88) i =1 log ( (cid:88) z ∈Z P ( y i | z ) N ( x i ) z ) , (7)where P ( y i | z ) is the conditional distribution of the targetlabel y i given the source label z , and N ( x i ) z is the prob-ability that the output of the network that takes x i as theinput is the label z . Similarly, the LEEP score based on N n and N p with the same target dataset can be denoted as L n and L p , respectively. The divergence between L n and L p can be used to estimate transferability for N n and N p fromthe same source task. XPERIMENTAL S TUDIES

We use two datasets to pretrain models and perform su-pervised transfer learning. First, We use ImageNet thatcontains millions of natural images to pretrain the models.Even though nature images follow different distributionswith CT images, pretraining on ImageNet enables modelsto learn redundant patterns that are shared by natural andmedical images. After that, we use COVIDx [7], a collectionof images from medical domain, to extract similar patternsthat are shared by medical images. We use COVID19-CTto perform self-supervised transfer learning and CT-baseddiagnosis of COVID-19. To our best knowledge, the datasetis the largest public-available CT dataset for COVID-19.

COVIDx

The COVIDx dataset is a public-available labeleddataset containing chest X-ray (CXR) images for COVID-19 detection. The dataset contains 16898 images in total,among which 573 images are for COVID-19 cases, 5559images are for regular pneumonia (non COVID-19) casesand the rest 8066 are normal cases. The dataset is generatedfrom 5 sources; those are, COVID-19 Image Data Collec-tion [36], COVID-19 Chest X-ray Dataset Initiative [37],ActualMed COVID-19 Chest X-ray Dataset Initiative [38],RSNA Pneumonia Detection Challenge dataset [39] andCOVID-19 radiography database [40]. Generally, CXR imag-ing is a low-cost, ﬁrst-look technique compared with CTscanning. CXR images usually have lower quality than CTimages but can be obtained much faster. Due to their easy-obtaining nature, the existing CXR datasets are much larger than the CT datasets for COVID-19 diagnosis. However, theCXR imaging and CT scanning techniques have somethingin common. The CXR imaging uses a small amount ofradiation to go through and take an image of the chest.CT scanning is essentially a more detailed type of CXRthat makes more comprehensive views of the chest. In thissense, the achieved CXR and CT images follow similardistributions and may share common patterns for imagerepresentation learning. It is a vital step to learn and transferknowledge to the target dataset from a source dataset thatis larger but follows similar distributions with the targetdataset.

COVID19-CT

The COVID19-CT dataset is the largestpublic-available CT dataset for COVID-19 diagnosis. It con-tains 349 images as COVID-19 positive and 397 imagesas COVID-19 negative. The dataset is originally split intotraining, validation and test sets. Speciﬁcally, there are 191positive and 234 negative images in the training set; 60positive and 58 negative images in the validation set; 98positive and 105 negative images in the test set. The imagesare of different spatial sizes, which vary from 153 to 1853.

We employ two backbone networks ResNet-50 and ResNet-101, where purely convolutional layers are used to extractfeatures from images. We then investigate two scenarios thatadding 1 ATTN and 5 ATTNs for each backbone network.The ATTN is inserted in the res block for the former. For thelatter, 2 ATTNs are inserted in both the res and res , andanother ATTN is inserted in the res block. This results in atotal of six networks. Each of the ResNet-50 and ResNet-101has three variants, including the baseline, the baseline with1 ATTN and the baseline with 5 ATTNs. Notably, all ATTNsare added at the end of the corresponding residual blocks.We ﬁrst perform STL-N and pretrain all the six networkson ImageNet ILSVRC 2012 image classiﬁcation dataset [41],which contains 1.2 million natural images for training, 50thousand for validation and another 50 thousand for testing.There are 1000 classes in total. We adopt the same dataaugmentation pipeline as in [32]. Speciﬁcally, each image isscaled to × and a patch of size × is randomlycropped as a training sample. Horizontal ﬂip is randomlyperformed for each cropped patch with a probability of 0.5.We employ the dropout [42] with a rate of 0.8 and the weightdecay of 1e-4 to avoid over-ﬁtting. To optimize the models,we employ the stochastic gradient descent (SGD) optimizerwith a momentum of 0.9 to train models for 90 epochs. Theinitial learning rate is set to 0.1 and decays by 0.1 every 30epochs. We use 8 TITAN Xp GPUs and the batch size is setto 512 for training.We then perform STL-M to pretrain the models on theCOVIDx dataset. The used data augmentation scheme iscomposed of several techniques including translation, ro-tation, horizontal ﬂip, intensity shift. All the techniques arerandomly performed with a probability of 0.5. SGD is usedwith a learning rate equal to 2e-4. A dropout with a keeprate of 0.5 is adopted to avoid over-ﬁtting. The numberof epochs is set to 30 and the batch size is 64 during thepretraining procedure.During the SSTL-M, we follow the same data augmen-tation scheme as in [26]. Speciﬁcally, all images as well as TABLE 1Top-1 and Top-5 accuracies (%) and performance improvements of allthe six networks on ImageNet. There are two columns for each metric.The value column provides the original results, and the improv. columnshows the improvements based on the baselines.

Model Top-1 Top-5Value Improv. Value Improv.R-50 Baseline 77.2 0 93.3 01 ATTN 77.9 0.7 93.9 0.65 ATTNs 78.3 1.1 94.1 0.8R-101 Baseline 78.3 0 94.0 01 ATTN 79.2 0.9 94.4 0.45 ATTNs 79.5 1.2 94.6 0.6 cropped regions are scaled to and a patch with spatialsizes is randomly cropped from each image. Then colorjittering, horizontal ﬂip, and grayscale conversion are per-formed randomly with a probability of 0.5. The temperatureparameter τ in Equation 2 is set to 0.07. The weights inEquation 6 are set to α = 0 . , and α = 0 . . The SGD isemployed as our optimizer with weight decay equal to 1e-4and SGD momentum equal to 0.9. The training is performedfor 200 epochs. The initial learning rate is 0.3 and decaysby 0.1 at the 120th and the 160th epochs. The batch sizeis set to 256 with 8 TITAN Xp GPUs. When comparingour method with the MoCo and SimCLR, we use the samehyperparameters as used in their papers [26], [28] for theMoCo and SimCLR.After ﬁnishing the pretraining procedures, we use eachpretrained model as a starting point and ﬁnetune it on thetarget dataset COVID19-CT for COVID-19 prediction. Weuse the same optimizer and hyperparameters as in the SSTL-M. The data augmentation scheme is also the same duringtraining. During inference, the center-cropped patch withsize for each image is used, and other augmentationtechniques are the same. We ﬁrst investigate how powerful the networks with ATTNsare on ImageNet. We denote ResNet-50 as R-50 and ResNet-101 as R-101 for convenience. Each of the R-50 and R-101has three variants including the baseline CNN, the baselinewith 1 ATTN, and the baseline with 5 ATTNs. We denotethem as baseline, 1 ATTN and 5 ATTNs for convenience. Weconduct experiments to predict the top-1 and top-5 accura-cies on the validation set of the ImageNet as the test datahas no labels. In addition, we also compute performanceimprovements compared with the baselines. The results arereported in Table 1. We can observe from the table that byadding ATTNs, the performance improvement is less thanor around 1% on the validation set of the ImageNet.We then conduct experiments to examine the overallperformance of all the models on the target dataset. Thesemodels are all pretrained on the three source tasks STL-N,STL-M and SSTL-M in order. The metrics include accuracy,F1-score and AUC. For each metric, we also compute per-formance improvements compared with the baselines. Theresults are reported in Table 2. We can ﬁnd from the tablethat adding ATTNs to the networks and then performingtransfer learning can signiﬁcantly improve prediction per-formance. In terms of accuracy, adding 1 ATTN for transfer

TABLE 2Overall performance for COVID-19 CT image prediction of all the six models pretrained with STL-N, STL-M and SSTL-M in terms of accuracy(%),F1-score(%) and AUC(%). There are two columns for each metric. The value column provides the original results, and the improv. column showsthe improvements based on the baselines.

Model Accuracy F1-Score AUCValue Improv. Value Improv. Value Improv.R-50 Baseline 88.2 0 88.7 0 90.3 01 ATTN 93.1 4.9 92.9 4.2 92.9 2.65 ATTNs 93.9 5.7 94.7 6.0 96.8 6.5R-101 Baseline 89.3 0 88.8 0 90.7 01 ATTN 93.4 4.1 93.2 4.4 93.7 3.05 ATTNs 94.2 4.9 95.3 6.5 97.8 7.1

TABLE 3Comparison among different self-supervised learning methods in termsof accuracy(%), F1-score(%) and AUC(%). All the networks are ﬁrstpretrained on the same supervised source tasks STL-N and STL-M,then pretrained with different SSTL-M methods on the same CT data.

Model Accuracy F1-Score AUCR-50 Baseline w MoCo 87.3 88.6 90.0Baseline w SimCLR 87.7 87.9 90.1Baseline w Ours 88.2 88.7 90.31 ATTN w MoCo 92.5 92.3 92.61 ATTN w SimCLR 92.6 91.7 92.11 ATTN w Ours 93.1 92.9 92.95 ATTNs w MoCo 93.6 94.3 96.15 ATTNs w SimCLR 93.5 94.0 95.95 ATTNs w Ours 93.9 94.7 96.8R-101 Baseline w MoCo 88.6 88.3 90.6Baseline w SimCLR 89.1 88.3 90.5Baseline w Ours 89.3 88.8 90.71 ATTN w MoCo 92.5 93.0 93.71 ATTN w SimCLR 92.6 92.7 93.01 ATTN w Ours 93.4 93.2 93.75 ATTNs w MoCo 93.7 95.0 96.65 ATTNs w SimCLR 93.1 94.5 96.65 ATTNs w Ours 94.2 95.3 97.8 learning leads to an average improvement of 4.5%, andadding 5 ATTNs leads to an average improvement of 5.3%compared with the baseline ResNets. Similar results canbe found on the metrics F1-score and AUC. These resultsindicate the effectiveness of integrating ATTNs into ourproposed multi-stage transfer learning framework.

We investigate the effectiveness of our proposed self-supervised learning method. For both the R-50 and R-101 groups, we ﬁrst conduct the same supervised transferlearning tasks STL-N and STL-M. After that, we appliedthe trained models to two state-of-the-art self-supervisedlearning frameworks, the MoCo [26] and the SimCLR [28],and our method, which we denote as X w MoCo, X wSimCLR and X w Ours, respectively. X denotes either base-line, adding 1 ATTN or adding 5 ATTNs. The experimentalresults are reported in Table 3. We can observe from thetable that models with our method consistently outperformmodels with the MoCo or SimCLR on all the three metrics.Speciﬁcally, considering all the three networks together, ourmethod outperforms the MoCo and SimCLR by an averagemargin of 0.6% and 0.5% in terms of accuracy for the R- 50 Group, and an average margin of 0.7% and 0.6% interms of accuracy for the R-101 Group. Similar results canalso be achieved for the F1-score and AUC. This indicatesby using a multi-scale learning framework and consider-ing distinguishing patterns from local lobes, our methodcan successfully extract useful inherent patterns from theCT data, thereby leading to performance improvement forCOVID-19 diagnosis based on CT images.

We design experiments to explore whether ATTNs bringbeneﬁts for transfer learning. For both the R-50 and R-101 groups we conduct two sets of experiments. First,we directly optimize the networks on the target datasetCOVID19-CT without transfer learning, namely X w/o TL,where X denotes either baseline, adding 1 ATTN or adding5 ATTNs. Second, we apply all the networks to our multi-stage transfer learning framework that we perform all thethree stages of pretraining for all the networks and thenﬁnetune the pretrained models on the target dataset. Wename such models as X w TL, where X has the samemeaning as above. The performance is evaluated on thetest set of the COVID19-CT dataset. We then compute theimprovements by using transfer learning for the two groupsof networks and the results are reported in Table 4. Wecan observe from the table that models with the proposedtransfer learning framework consistently improve the pre-diction performance on the target dataset. This indicatesthat by using our multi-stage transfer learning framework, itsuccessfully extracts important patterns between the sourceimages and the target images.More importantly, the table shows that networks withATTNs achieve much larger performance improvementsthrough transfer learning than the baseline ResNets. Interms of accuracy on ResNet-50, the improvement for thebaseline is 4.5%, while and improvements for adding 1ATTN and adding 5 ATTNs are 8.2% and 9.2%, respectively.The beneﬁts induced by attention are 2.7% and 3.7%. Similarresults can be observed for the ResNet-101 and the beneﬁtsinduced by attention are 2.5% and 2.6%. Consistent con-clusions can be obtained on the metrics F1-score and AUCthat attention helps transfer learning. For the two settingswith ATTNs, the beneﬁts brought by attention are 5.9%and 7.0% in terms of F1-score for ResNet-50, and 4.7% and5.9% for ResNet-101. In terms of AUC, attention helps withmargins of 2.5% and 4.2% for ResNet-50, and margins of3.1% and 5.7% for ResNet-101. These results indicate that

TABLE 4Results for COVID-19 CT image prediction of all the six networks with and without transfer learning in terms of accuracy(%), F1-score(%) andAUC(%). There are two columns for each metric. The value column provides the original results, and the improv. column shows the performanceimprovements of networks with transfer learning compared with networks without transfer learning.

Model Accuracy F1-Score AUCValue Improv. Value Improv. Value Improv.R-50 Baseline w/o TL 83.7 0 84.1 0 84.6 0Baseline w TL 88.2 4.5 88.7 4.6 90.3 5.71 ATTN w/o TL 84.9 0 82.4 0 85.7 01 ATTN w TL 93.1 8.2 92.9 10.5 92.9 7.25 ATTNs w/o TL 84.7 0 83.1 0 86.9 05 ATTNs w TL 93.9 9.2 94.7 11.6 96.8 9.9R-101 Baseline w/o TL 83.8 0 83.9 0 86.2 0Baseline w TL 89.3 5.5 88.8 4.9 90.7 4.51 ATTN w/o TL 85.4 0 83.6 0 86.1 01 ATTN w TL 93.4 8.0 93.2 9.6 93.7 7.65 ATTNs w/o TL 86.1 0 84.5 0 87.6 05 ATTNs w TL 94.2 8.1 95.3 10.8 97.8 10.2

TABLE 5Results for COVID-19 CT image prediction of all the six networks without transfer learning, with STL-N and with STL-M in terms of accuracy(%),F1-score(%) and AUC(%). There are two columns for each metric. The value column provides the original results, and the improv. column showsthe performance improvements of networks with each of the STL-N and STL-M compared with networks without transfer learning.

Model Accuracy F1-Score AUCValue Improv. Value Improv. Value Improv.R-50 Baseline w/o TL 83.7 0 84.1 0 84.6 0Baseline w STL-N 85.4 1.7 85.9 1.8 86.4 1.8Baseline w STL-M 86.0 2.3 86.8 2.7 87.7 3.11 ATTN w/o TL 84.9 0 82.4 0 85.7 01 ATTN w STL-N 88.4 3.5 88.8 6.4 89.2 3.51 ATTN w STL-M 89.2 4.3 89.8 7.4 92.1 6.65 ATTNs w/o TL 84.7 0 83.1 0 86.9 05 ATTNs w STL-N 89.2 4.5 89.6 6.5 90.2 3.35 ATTNs w STL-M 88.7 4.0 88.8 5.7 91.2 4.3R-101 Baseline w/o TL 83.8 0 83.9 0 86.2 0Baseline + STL-N 85.8 2.0 86.9 3.0 88.1 1.9Baseline + STL-M 86.1 2.3 86.8 2.9 88.5 2.31 ATTN w/o TL 85.4 0 83.6 0 86.1 01 ATTN w STL-N 88.3 2.9 89.2 5.6 90.1 4.01 ATTN w STL-M 89.5 3.1 89.7 6.1 91.8 5.75 ATTNs w/o TL 86.1 0 84.5 0 87.6 05 ATTNs w STL-N 89.4 3.3 89.6 5.1 92.2 4.65 ATTNs w STL-M 90.0 3.9 90.2 5.7 92.4 4.8 compared with convolution, attention helps transfer moreuseful knowledge from the source tasks to the target task intransfer learning. By adding ATTNs, the network is capableof learning important common pattens of images in the pre-training stage, thereby leading to signiﬁcant performanceimprovement for the target task.

TABLE 6LEEP scores for all the six networks without transfer learning, withSTL-N, with both the STL-N and STL-M.

Model R-50 R-101Baseline w/o TL -0.918 -0.913Baseline w STL-N -0.908 0.-909Baseline w STL-N w STL-M -0.894 -0.8991 ATTN w/o TL -0.917 -0.9041 ATTN w STL-N -0.902 -0.8871 ATTN w STL-N w STL-M -0.882 -0.8725 ATTNs w/o TL -0.915 -0.9075 ATTNs w STL-N -0.897 -0.8805 ATTNs w STL-N w STL-M -0.875 -0.865

In this section, we conduct ablation study to examine thebeneﬁts of attention brought to the two supervised transferlearning tasks STL-N and STL-M. Instead of performingtwo tasks sequentially, we conduct pretraining on naturalimages and medical images separately to explore how at-tention beneﬁts transfer learning from different domains.Speciﬁcally, we name the models as X w/o TL, X w STL-N and X w STL-M, respectively. X denotes either baseline,adding 1 ATTN or adding 5 ATTNs. We compute theimprovements for each stage of transfer learning for boththe R-10 and R-101 groups and the results are reported inTable 5. We can observe from the table that attention con-sistently beneﬁts both the STL-N and STL-M. Speciﬁcally,on the ResNet-50, adding 1 ATTN beneﬁts the STL-N by amargin of 1.8% in terms of accuracy, 4.6% in terms of F1-score, and 1.7% in terms of AUC. Adding 1 ATTNs beneﬁtsthe STL-M by a margin of 2.0% in terms of accuracy, 4.7%in terms of F1-score, and 3.5% in terms of AUC. On the -0.922-0.912-0.902-0.892-0.882-0.872-0.862 A T T N A T T N B a s e l i n e A T T N A T T N B ase li n e L EE P S c o r e w/o TL w/o TLw STL_N w STL_Nw STL_N w STL_M w STL_N w STL_M R-50 R-101

Fig. 3. LEEP scores for both the R-50 and R-101 groups without transfer learning, with STL-N, with both the STL-N and STL-M.

ResNet-50, adding 5 ATTNs beneﬁts the STL-N by a marginof 2.8% in terms of accuracy, 4.7% in terms of F1-score, and1.5% in terms of AUC. Adding 5 ATTNs beneﬁts the STL-Mby a margin of 1.7% in terms of accuracy, 3.0% in terms ofF1-score, and 1.3% in terms of AUC. Similar results can becomputed on ResNet-101 that attention consistently beneﬁtsboth the STL-N and STL-M on all three metrics. Theseresults indicate that attention helps transfer much moreuseful knowledge than convolution in supervised transferlearning, no matter the source data follows totally differentor similar distributions with the target data.

In this section, we compute LEEP scores to examine thefunctionality of attention in transfer learning. Instead ofoptimizing parameters and making predictions on the tar-get dataset, we directly compute LEEP scores based onthe source models and statistics of the target data. AsLEEP scores can only be computed for supervised transferlearning, we conduct experiments on STL-N and STL-M toachieve LEEP scores. we add STL-N and STL-M in order,which results in models X w/o TL, X w STL-N and X wSTL-N w STL-M, respectively. X denotes either baseline,adding 1 ATTN or adding 5 ATTNs. The results are re-ported in Table 6 and shown in Figure 3. The curve foreach network setting in Figure 3 is composed of two linesegments, each of which indicates improvement by addinga stage of pretraining procedure. The slope of each linesegment can reﬂect the improvement in the LEEP scoreby adding the corresponding stage of transfer learning. Wecan observe from the ﬁgure that both adding 1 ATTN andadding 5 ATTNs achieve larger improvement than the base-line ResNets for all the two stages. This again demonstratesattention helps transfer learning regardless of divergencesbetween the distributions of the source datasets and thetarget dataset.

In addition to the quantitative results in the above sections,we provide qualitative results to show the capability ofour proposed multi-stage attentive transfer learning frame-work when detecting important regions in CT image forCOVID-19 diagnosis. We use ResNet-50 as the baseline and then adding 5 ATTNs to the baseline. Both the networksare pretrained on the three source tasks and ﬁnetuned onthe target task. We use CNN Grad-CAM [43] to visualizeconvolution maps for the last convolutional layer rightbefore the global average pooling. For the network with5 ATTNs, we visualize the last ATTN inserted in the res .We simply perform average on attention score maps of allthe pixels and achieve the ﬁnal attention map. As shown inFigure 4, after performing transfer learning, a ATTN layercan successfully detect regions that are severely infectedby the virus. However, convolutional layers fail in somecases where the uninfected regions are highlighted. Thisagain demonstrates the effectiveness of using attention inour proposed multi-stage transfer learning framework. ONCLUSION

We propose a uniﬁed transfer learning framework andexamine how attention facilitates transfer learning for im-proving COVID-19 diagnosis. We ﬁrst design a multi-stagetransfer learning framework, which consists of supervisedtransfer learning from natural images (STL-N), supervisedtransfer learning from medical images (STL-M) and self-supervised transfer learning from medical images (SSTL-M). This framework allows transferring knowledge fromdata of different domains, such as large-scale labeled naturalimages, large-scale labeled medical images and the same CTimages. As existing self-supervised learning methods usu-ally generate poor results on tasks related to medical images,we propose a novel self-supervised learning method basedon the understanding of substructures of the human lung.The method is integrated as the last stage in our transferlearning framework and self-supervised transfer learningis performed to reuse the complex patterns learned fromthe same CT images. Experimental results show that ourmethod outperforms several SOTA baseline methods. Forthe networks used in our transfer learning framework, weintegrate self-attention layers into ResNets and apply themto the proposed transfer learning framework. Experimentalresults demonstrate attention has higher transferability thanconvolution. To our best knowledge, this is the ﬁrst work tocompare transferability of attention and convolution whenthe source tasks and data are the same in transfer learning. Orignal CNN Grad-CAM Attention Heatmap

Fig. 4. Comparison of visualization results between a convolutional layer and a ATTN. The lowest pixel values are shown in blue while the highestpixel values are shown in red. Column 1 are the original CT images with COVID-19 positive. Columns 2 and 3 are visualization results for the lastconvolutional layer in ResNet-50, where column 2 are the generated heatmaps and column 3 are the original images integrated with heatmaps.Columns 4 and 5 are visualization results for the last ATTN in ResNet-50 with 5 ATTNs, where column 4 are the generated attention heatmaps andcolumn 5 are the original images integrated with attention heatmaps. A CKNOWLEDGMENTS

This work was supported by National Science Foundationgrants DBI-1922969 and IIS-1908220. R EFERENCES [1] Y.-W. Tang, J. E. Schmitz, D. H. Persing, and C. W. Stratton,“Laboratory diagnosis of covid-19: current issues and challenges,”

Journal of clinical microbiology , vol. 58, no. 6, 2020.[2] B. Udugama, P. Kadhiresan, H. N. Kozlowski, A. Malekjahani,M. Osborne, V. Y. Li, H. Chen, S. Mubareka, J. B. Gubbay, and W. C.Chan, “Diagnosing covid-19: the disease and tools for detection,”

ACS nano , vol. 14, no. 4, pp. 3822–3835, 2020. [3] X. Xie, Z. Zhong, W. Zhao, C. Zheng, F. Wang, and J. Liu, “Chestct for typical 2019-ncov pneumonia: relationship to negative rt-pcrtesting.”

Radiology , pp. 200 343–200 343, 2020.[4] F. Zhou, T. Yu, R. Du, G. Fan, Y. Liu, Z. Liu, J. Xiang, Y. Wang,B. Song, X. Gu et al. , “Clinical course and risk factors for mortalityof adult inpatients with covid-19 in wuhan, china: a retrospectivecohort study,”

The lancet , 2020.[5] A. Bernheim, X. Mei, M. Huang, Y. Yang, Z. A. Fayad, N. Zhang,K. Diao, B. Lin, X. Zhu, K. Li et al. , “Chest ct ﬁndings in coron-avirus disease-19 (covid-19): relationship to duration of infection,”

Radiology , p. 200463, 2020.[6] Y. Li and L. Xia, “Coronavirus disease 2019 (covid-19): role of chestct in diagnosis and management,”

American Journal of Roentgenol-ogy , vol. 214, no. 6, pp. 1280–1286, 2020.[7] L. Wang and A. Wong, “Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chestx-ray images,” arXiv preprint arXiv:2003.09871 , 2020.[8] O. Gozes, M. Frid-Adar, H. Greenspan, P. D. Browning, H. Zhang,W. Ji, A. Bernheim, and E. Siegel, “Rapid ai development cycle forthe coronavirus (covid-19) pandemic: Initial results for automateddetection & patient monitoring using deep learning ct imageanalysis,” arXiv preprint arXiv:2003.05037 , 2020.[9] A. Alimadadi, S. Aryal, I. Manandhar, P. B. Munroe, B. Joe, andX. Cheng, “Artiﬁcial intelligence and machine learning to ﬁghtcovid-19,” 2020.[10] L. Wynants, B. Van Calster, M. M. Bonten, G. S. Collins, T. P.Debray, M. De Vos, M. C. Haller, G. Heinze, K. G. Moons, R. D.Riley et al. , “Prediction models for diagnosis and prognosis ofcovid-19 infection: systematic review and critical appraisal,” bmj ,vol. 369, 2020.[11] S. Wang, B. Kang, J. Ma, X. Zeng, M. Xiao, J. Guo, M. Cai, J. Yang,Y. Li, X. Meng et al. , “A deep learning algorithm using ct imagesto screen for corona virus disease (covid-19),” MedRxiv , 2020.[12] X. He, X. Yang, S. Zhang, J. Zhao, Y. Zhang, E. Xing, and P. Xie,“Sample-efﬁcient deep learning for covid-19 diagnosis based on ctscans,” medRxiv , 2020.[13] S. J. Pan and Q. Yang, “A survey on transfer learning,”

IEEETransactions on knowledge and data engineering , vol. 22, no. 10, pp.1345–1359, 2009.[14] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transferlearning,”

Journal of Big data , vol. 3, no. 1, p. 9, 2016.[15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferableare features in deep neural networks?” in

Advances in neuralinformation processing systems , 2014, pp. 3320–3328.[16] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visualrepresentation learning by context prediction,” in

Proceedings ofthe IEEE international conference on computer vision , 2015, pp. 1422–1430.[17] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre-sentation learning by predicting image rotations,” arXiv preprintarXiv:1803.07728 , 2018.[18] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data aug-mentation: Self-supervision and self-distillation,” arXiv preprintarXiv:1910.05872 , 2019.[19] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervisedvisual representation learning,” in

Proceedings of the IEEE conferenceon Computer Vision and Pattern Recognition , 2019, pp. 1920–1929.[20] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self-supervised semi-supervised learning,” in

Proceedings of the IEEEinternational conference on computer vision , 2019, pp. 1476–1485.[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

Advancesin neural information processing systems , 2015, pp. 91–99.[23] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, “Transfusion: Un-derstanding transfer learning for medical imaging,” in

Advances inneural information processing systems , 2019, pp. 3347–3357.[24] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau,and S. Thrun, “Dermatologist-level classiﬁcation of skin cancerwith deep neural networks,” nature , vol. 542, no. 7639, pp. 115–118, 2017.[25] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, andT. Brox, “Discriminative unsupervised feature learning with exem-plar convolutional neural networks,”

IEEE transactions on patternanalysis and machine intelligence , vol. 38, no. 9, pp. 1734–1747, 2015.[26] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrastfor unsupervised visual representation learning,” in

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 9729–9738.[27] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines withmomentum contrastive learning,” arXiv preprint arXiv:2003.04297 ,2020.[28] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simpleframework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709 , 2020.[29] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neuralnetworks,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2018, pp. 7794–7803.[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advances in Neural Information Processing Systems , 2017, pp. 6000–6010.[31] T. G. Kolda and B. W. Bader, “Tensor decompositions and applica-tions,”

SIAM review , vol. 51, no. 3, pp. 455–500, 2009.[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2016, pp. 770–778.[33] L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, andD. Rueckert, “Self-supervised learning for medical image analysisusing image context restoration,”

Medical image analysis , vol. 58, p.101539, 2019.[34] R. Drake, A. W. Vogl, and A. W. Mitchell,

Gray’s Anatomy forStudents E-Book , 3rd ed. Elsevier Health Sciences, 2009, pp. 167–174.[35] C. V. Nguyen, T. Hassner, C. Archambeau, and M. Seeger, “Leep:A new measure to evaluate transferability of learned representa-tions,” arXiv preprint arXiv:2002.12462 , 2020.[36] J. P. Cohen, P. Morrison, and L. Dao, “Covid-19 image datacollection,” arXiv preprint arXiv:2003.11597 , 2020.[37] I. L ¨utkebohle, “Bworld robot control software,” http://aiweb.techfak.uni-bielefeld.de/content/bworld-robot-control-software , 2008.[38] A. Chung, “Actualmed COVID-19 chest x-ray data initiative,” https://github.com/agchung/Actualmed-COVID-chestxray-dataset ,2020.[39] R. S. of North America, “Rsna pneumonia detection chal-lenge,” , 2019.[40] R. of North America, “COVID-19 radiography database,” ,2019.[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-geNet: A Large-Scale Hierarchical Image Database,” in

CVPR09 ,2009.[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural net-works from overﬁtting,”

The Journal of Machine Learning Research ,vol. 15, no. 1, pp. 1929–1958, 2014.[43] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in