[PDF] A novel multiple instance learning framework for COVID-19 severity assessment via data augmentation and self-supervised learning

Abstract

How to fast and accurately assess the severity level of COVID-19 is an essential problem, when millions of people are suffering from the pandemic around the world. Currently, the chest CT is regarded as a popular and informative imaging tool for COVID-19 diagnosis. However, we observe that there are two issues -- weak annotation and insufficient data that may obstruct automatic COVID-19 severity assessment with CT images. To address these challenges, we propose a novel three-component method, i.e., 1) a deep multiple instance learning component with instance-level attention to jointly classify the bag and also weigh the instances, 2) a bag-level data augmentation component to generate virtual bags by reorganizing high confidential instances, and 3) a self-supervised pretext component to aid the learning process. We have systematically evaluated our method on the CT images of 229 COVID-19 cases, including 50 severe and 179 non-severe cases. Our method could obtain an average accuracy of 95.8%, with 93.6% sensitivity and 96.4% specificity, which outperformed previous works.

Full PDF

aa r X i v : . [ ee ss . I V ] F e b A novel multiple instance learning framework for COVID-19 severityassessment via data augmentation and self-supervised learning

Zekun Li † , Wei Zhao † , Feng Shi, Lei Qi, Xingzhi Xie, Ying Wei, Zhongxiang Ding,Yang Gao, Shangjie Wu, Jun Liu ⋆ , Yinghuan Shi ⋆ , Dinggang Shen ⋆ How to fast and accurately assess the severity level of COVID-19 is an essential problem, when millions of people are sufferingfrom the pandemic around the world. Currently, the chest CT is regarded as a popular and informative imaging tool for COVID-19diagnosis. However, we observe that there are two issues – weak annotation and insufﬁcient data that may obstruct automaticCOVID-19 severity assessment with CT images. To address these challenges, we propose a novel three-component method, i.e. , 1) adeep multiple instance learning component with instance-level attention to jointly classify the bag and also weigh the instances, 2) abag-level data augmentation component to generate virtual bags by reorganizing high conﬁdential instances, and 3) a self-supervisedpretext component to aid the learning process. We have systematically evaluated our method on the CT images of 229 COVID-19cases, including 50 severe and 179 non-severe cases. Our method could obtain an average accuracy of 95.8%, with 93.6% sensitivityand 96.4% speciﬁcity, which outperformed previous works.

Index Terms —COVID-19, Chest CT, Multiple instance learning, Data augmentation, Self-supervised learning

I. INTRODUCTIONRecently, a new coronavirus, named by the World HealthOrganization (WHO) as COVID-19, has been rapidly spread-ing worldwide. As of 23 October 2020, there have been morethan forty million conﬁrmed COVID-19 cases globally. Inview of its emergency and severity, WHO has announcedCOVID-19 outbreak a pandemic.Due to the rapid spread, long incubation period and severerespiratory symptoms of COVID-19, clinical systems aroundthe world are under tremendous pressure in multiple aspects.In current study, how to fast and accurately diagnose COVID-19 and assess its severity has become an important prerequisitefor clinical treatment.At present, for the diagnosis of COVID-19, traditionalreverse transcription polymerase chain reaction (RT-PCR) iswidely employed worldwide as a gold standard. However,due to its high false negative rate, repeat testing might beneeded to achieve an accurate diagnosis of COVID-19. Thechest computed tomography (CT) has been an imaging tool † Zekun Li and Wei Zhao are the co-ﬁrst authors.* Asterisk denotes co-corresponding authors.Zekun Li, Yinghuan Shi, and Yang Gao are with the State Key Laboratoryfor Novel Software Technology, Nanjing University, China. They are also withNational Institute of Healthcare Data Science, Nanjing University, Nanjing,210046, China.Lei Qi is with School of Computer Science and Artiﬁcial Intelligence,Southeast University, Nanjing, 210018, China.Ying Wei, Feng Shi, and Dinggang Shen are with Department of Researchand Development, Shanghai United Imaging Intelligence Co., Ltd., Shanghai,201807, China. Dinggang Shen is also with School of Biomedical Engineer-ing, ShanghaiTech University, Shanghai, China and Department of ArtiﬁcialIntelligence, Korea University, Seoul 02841, Republic of Korea.Wei Zhao, Xingzhi Xie, and Jun Liu are with Department of Radiology,The Second Xiangya Hospital, Central South University, Hunan, 410011,China. Jun Liu is also with Department of Radiology Quality Control Center,Changsha, 410011, China.Zhongxiang Ding is with the Department of Radiology, Hangzhou FirstPeople’s Hospital, Zhejiang University School of Medicine, Hangzhou, Zhe-jiang, China.Shangjie Wu is with Department of Pulmonary and Critical Care Medicine,The Second Xiangya Hospital, Central South University, Changsha, 410011,China. !"

Fig. 1. Examples of chest CT images with severe infection (left) and non-severe infection (right) of COVID-19. The yellow arrows indicate represen-tative infection regions. frequently used for diagnosing other diseases, and becauseit is fast and easy to operate, it has become a widely useddiagnosis tool for COVID-19 in China. However, not only ismanual diagnosis by CT images laborious, it is also proneto the inﬂuence of some subjective factors, e.g. , fatigue andcarelessness.CT images are rather informative, with numbers of discrim-inative imaging biomarkers, so they are useful in assessing theseverity of COVID-19. Based on the observation of Tang et al. [1], the CT images of severe cases usually have larger volumeof consolidation regions and ground glass opacity regions, thanthose of non-severe cases. Therefore, several computer-aidedmethods [1], [2], [3] have been recently proposed. However,we notice that current studies have neglected two importantissues. • Weak Annotation . Usually, it is rather time-consumingfor physicians to precisely delineate the infection regionsmanually. Therefore, only the image-level annotation ( i.e. ,label) for indicating the class of cases ( i.e. , severe or non-severe) is available, which could be regarded as a weakly-supervised learning setting.

This inspires us to developa model that works under weakly-supervised setting(merely with image-level annotation) . • Insufﬁcient Data . According to current studies, it re-mains difﬁcult to collect and label a large set of COVID-19 data. Besides, given the prevalence rate, the number ofnon-severe cases is much larger than that of severe cases,bringing about a signiﬁcant issue of class imbalance,that raises the challenge of learning a stable model byavoiding overﬁtting.

This motivates us to seek ways toease the imbalance of different classes and make themost of the insufﬁcient data.

Therefore, aiming to achieve the fast and accurate COVID-19 severity assessment with CT images, we propose a novelweakly supervised learning method via multiple-instance dataaugmentation and self-supervised learning. Our method isdesigned to solve the problems of weak annotation and in-sufﬁcient data in a uniﬁed framework.On one side, the concept of weak annotation comes fromweakly supervised learning paradigm, whose goal is to developlearning models under three types of supervision: inexactsupervision, incomplete supervision, and inaccurate supervi-sion. In the severity assessment task, weak annotation is onetype of inexact supervision where only the image-level labelis provided by physicians whereas the region-level label isunavailable. Formally, we model this weak annotation taskunder multiple instance setting: We divide a CT image intoseveral patches ( i.e. , unannotated instances), to make it asa bag consisting of multiple instances. Similar to multipleinstance learning setting, the image indicated with severe ornon-severe infection is considered as the positive or negativebag, respectively.On the other side, the problem of insufﬁcient data greatlychallenges the robustness and stability of a learned model.We notice that in current studies on COVID-19, sizes ofsamples are often small. To tackle this challenge, we aremotivated by two major aspects: 1) to complement the originaldata by generating additional “virtual” data by using dataaugmentation technique, and 2) to leverage the patch-levelinformation to beneﬁt the learning process, since the quantityof patches is much larger than that of images. In particular,1) we develop a simple yet effective multiple-instance dataaugmentation method to generate virtual bags to enrich theoriginal data and guide stable training process; 2) Alongwith the bag-level labels for supervised training, there isalso abundant unsupervised information that we can minefrom the sufﬁcient unannotated instances, so we apply self-supervised learning, in the form of patch location tasks, toexploit characteristic information of the patches.In this paper, we propose a method consisting of three majorcomponents (see Fig. 2). Speciﬁcally, 1) we build a deepmultiple instance learning (MIL) model with instance-levelattention to jointly classify the bag and weigh the instances in each bag, so as to ﬁnd the positive key instances ( i.e. ,instances with high conﬁdence to the class of “severe”); 2) Wedevelop an instance-level augmentation technique to generatevirtual positive bags by sampling from these key instances,which helps to ease the problem of class imbalance andstrengthen the learning process; 3) We introduce an auxiliaryself-supervised loss to render extracted features more discrim-inative, by including characteristic information of the patches.With extra information extracted from the unannotated in-stances, the performance of MIL model could be furtherimproved. These three components are logically integrated ina uniﬁed framework: 1) Three components are alternativelyupdated to beneﬁt each other in an end-to-end manner; 2)Data augmentation could alleviate the label imbalance issue intraining of the MIL model, while the trained MIL model couldguide data augmentation to produce more meaningful bags; 3)Self-supervised pretext task is able to beneﬁt the MIL modelto being location-aware, which was ignored in traditional MILsetting. In our evaluation, we extensively validated the efﬁcacyof our three components.In the following four sections, we ﬁrst introduce relatedliterature (Section II), then we present the technical detailsof the proposed method (Section III), and ﬁnally report thequalitative and quantitative experimental results (Section IV)before drawing a conclusion (Section VI).II. R

ELATED W ORK

We would like to review related work on four aspects: 1)COVID-19 severity assessment, 2) multiple instance learning,3) data augmentation, and 4) self-supervised learning.

A. COVID-19 Severity Assessment

Along with diagnosis, severity assessment is another im-portant factor for treatment planning. So far, there have beena few relevant attempts at predicting severity of COVID-19with CT images. Tang et al. [1] proposed a random forest(RF)-based model to assess the severity of COVID-19 basedon 63 quantitative features, e.g. , the infection volume/ratio ofthe whole lung and the volume of ground-glass opacity (GGO)regions. Besides, importance of each quantitative feature iscalculated from the RF model. Yang et al. [2] proposed toevaluate the value of chest computed tomography severityscore (CT-SS) in differentiating clinical forms of COVID-19.The CT-SS was deﬁned by summing up individual scoresfrom 20 lung regions. In Yang et al. ’s work, these scoreswere assigned for each region based on parenchymal opaci-ﬁcation. Li et al. [4] proposed a similar method dependingon the total severity score (TSS), which was reached bysumming the ﬁve lobe scores. Shan et al. [3] developed adeep learning based segmentation system, namely “VB-Net”,to automatically segment and quantify infection regions inCT scans of COVID-19 patients. To accelerate the manualdelineation of CT scans for training, a human-involved-model-iterations (HIMI) strategy was adopted to assist radiologiststo reﬁne automatic annotation of each training case. Aboveexisting methods all depended on manually deﬁned scores ormanual segmentations given by experts. Chao et al. [5] and !" ! /$")0 !" !" !" !53)$%&’7!’$8 !’()$%& !" ./0&"1$2&’!" !" !" Fig. 2. An overview of our method. In our method, the CT images are cropped into patches, which are then packed into MIL bags. In the k th epoch oftraining process, the data for supervised training consists of real bags ( i.e. , training CT images) and virtual bags generated in the ( k − th epoch. Besides,real bags are also used for the auxiliary self-supervised learning task (while virtual bags are not). After the training stage, the trained MIL model will takethe testing CT images (also modeled as MIL bags) as input to predict their labels ( i.e. , severe or non-severe). Chassagnon et al. [6] further extended the problem to patientoutcome prediction, combining both imaging and non-imaging(clinical and biological) data.

B. Multiple Instance Learning

Multiple instance learning is one paradigm of weakly super-vised learning, which belongs to “inexact supervision” [7]. Interms of existing MIL methods for image classiﬁcation task,they can be roughly divided into two categories, i.e. , instance-level methods and bag-level methods. Instance-level methodsassume that all instances in a bag contribute equally to theprediction of the bag label [8], under which assumption, theprediction of bag-level label is conducted by aggregating ( e.g. ,voting or pooling) the prediction of instance-level labels ineach bag. However, this type of approaches [9], [10] sufferfrom a major limitation – the label of each instance is usuallyseparately predicted without considering other instances (eventhose in the same bag), rendering the label easily disrupted byincorrect instance-level predictions. However, different frominstance-level methods, by considering the class informationof all instances, bag-level methods usually achieve higheraccuracy and better time efﬁciency, as proved in [8]. In thissense, the promising property of MIL works quite well withour weakly supervised image classiﬁcation task, as indicatedby many current studies working towards this direction [11],[12], [10].As is known, medical images are usually infeasible toobtain pixel-level annotations because this demands enormoustime and perfect accuracy from clinical experts. Therefore,there has been a great interest in applying MIL methodsto medical imaging [13]. Quellec et al. [14] attempted todivide the medical image into small-sized patches that can beconsidered as a bag with a single label. Sirinukunwattana etal. [15] further extended this application to the computationalhistopathology where patches correspond to cells to indicatemalignant changes. Ilse et al. [16] proposed an attention-basedmethod that aims at incorporating interpretability to the MIL approach while increasing its ﬂexibility at the same time. Han et al. [17] innovatively incorporated an automated deep 3Dinstances generator into the attention-based MIL method, foraccurate screening of COVID-19. There are also other differentMIL approaches used in medical image analysis tasks, e.g. ,Gaussian processes [18] and a two-stage approach with neuralnetworks and expectation maximization (EM) algorithm todetermine the classes of the instances [19].

C. Data Augmentation

Data augmentation is a data-space solution to the problem oflimited data. To increase the amount and the diversity of data,there has been a great interest in data augmentation recently,since many applications, e.g. , medical image analysis, mightnot always have sufﬁcient labeled training data to train. So far,a number of techniques have been developed to enhance thesize and quality of training sets to build better deep learningmodels.One type of data augmentation methods are designed byperforming the basic image processing operators. For example,Taylor and Nitschke [20] provided a comparative study ofthe effectiveness of geometric transformations ( e.g. , ﬂipping,rotating, and cropping), and that of color space transformations( e.g. , color jittering, edge enhancement and PCA). Zhang etal. [21] proposed mixup , which trains the learning modelon virtual examples constructed by a linear interpolation oftwo random examples from the training set. Zhong et al. [22] developed random erasing inspired by the mechanismsof dropout regularization. Similarly, DeVries and Taylor [23]proposed a method named as Cutout Regularization.Note that there are also several attempts at learning-baseddata augmentation. Frid-Adar et al. [24] tested the effec-tiveness of using DCGANs to generate liver lesion medicalimages. Applying meta learning concepts in neural architecturesearch (NAS) to data augmentation, several methods suchas Neural Augmentation [25], Smart Augmentation [26], andAuto-Augment [27], were further developed in recent litera-ture.

Unfortunately, under the MIL setting, the labels of instancesare not available during training, which means previous dataaugmentation methods cannot be directly borrowed. In orderto relieve the data scarcity problem during COVID-19 severityassessment, we have to develop a novel augmentation tech-nique that works for our MIL setting.

D. Self-supervised Learning

In many recent studies of unsupervised learning, a commonmethod is to deﬁne an annotation-free pretext task to providea surrogate supervision signal for feature learning. By solvingsuch pretext tasks, the trained model is expected to extracthigh-level semantic features that are useful for other down-stream tasks. So far, a large number of pretext tasks for self-supervised learning have been designed. For example, Larsson et al. [28] and Zhang et al. [29] predicted the colors of imagesby removing its original color information. Doersch et al. [30]and Noroozi and Favaro [31] predicted relative positions ofdifferent image patches in the same image. Gidaris et al. [32]predicted the random rotation applied to an image. Pathak etal. [33] predicted the missing central part of an image bybuilding the prediction model with context information. He etal. [34] presented a contrastive learning method, called Mo-mentum Contrast (MoCo), which outperformed its supervisedpre-training counterpart in several vision tasks. Zhou et al. [35] proposed a set of models trained by a robust, scalableself-supervised learning framework, called Models Genesis,for medical image analysis tasks. What these works have incommon is that they are all utilized to attain well pre-trainednetworks on unannotated images.Unlike these works above aiming at pre-trained models,Chen et al. [36] aimed to improve the performance of gen-erative adversarial networks by leveraging the supervision ofrotation prediction task. Similarly, Gidaris et al. [37] usedself-supervision as an auxiliary task and brought signiﬁcantimprovements to few-shot learning.

Remark.

As discussed above, both the weak supervisionmanner and data scarcity in COVID-19 severity assessmentpose a considerable challenge to our work. So our intuitionincludes the following two steps: 1) We found weak supervisedprediction of COVID-19 naturally agrees with the setting ofMIL; 2) Under the MIL setting, we try to solve the challengeof data scarcity by considering the relation between bag andinstance. In a nutshell, by confronting the double challenges ofweak supervision and data scarcity, our solution for COVID-19severity assessment is novel according to our best knowledge.III. M

ETHOD

In this section, we ﬁrst analyze the problem of COVID-19severity assessment and provide an overview of our method,then present and discuss thoroughly the technical details ofthree major components, i.e. , bag-level prediction, instance-level augmentation and auxiliary self-supervised loss.

A. Problem Analysis

In this part, we will ﬁrst analyze the main challenges inCOVID-19 severity assessment caused by weak supervisionand data scarcity, and then provide corresponding countermea-sures.For the annotation of CT images, image-level labels directlycome from diagnosis results of corresponding patients, guidedby the

Diagnosis and Treatment Protocol for COVID-19 (TrialVersion 7) from National Health Commission of the People’sRepublic of China. In this sense, infection regions in CTimages of COVID-19 patients remain unknown even when thisimage has already been labeled. This poses a great challengefor the utilization of traditional supervised learning models.To address this challenge, we introduce the multiple instancelearning (MIL) framework, a typical weakly-supervised learn-ing paradigm to deal with the image-level prediction withoutknowing any region-level annotation.In the MIL setting, each image could be regarded as a bag,and regions inside this image are thus regarded as instancesin this bag. In our case, chest CT images are processed asbags. To be more speciﬁc, each CT image consists of hundredsof slices that show different cross sections of lung regions.Moreover, each slice can be further cropped into several non-overlapping patches. And the patches from the same CT imagemake up a bag. Note that the label of a bag ( i.e. , the bag-level label) depends on the information provided by physicianson original CT images. These notions are illustrated in Fig.3. In this work, the MIL bags with the label “severe” arecalled positive bags , whereas those with the label “non-severe”are called negative bags . It is also worth mentioning thatthe instances ( i.e. , patches) related to the infection regionsin positive bags are without any annotated information duringtraining. !" !"

Fig. 3. A brief illustration of the notions of CT slices, instances (patches)and bags. A CT image contains CT slices, and the slices are cropped to non-overlapping patches, which are considered as instances. The patches fromthe same CT image make up a MIL bag, with a bag-level label “severe” or“non-severe”.

Another challenge is data scarcity, which usually makesstable learning model hard to realize, especially when thenumber of severe CT images is very limited. To address thisissue, we ﬁrst adopt the data augmentation technique duringtraining by generating virtual samples. Though data augmenta-tion has demonstrated its effectiveness in several other learningtasks [38], [39], previous augmentation technique cannot bedirectly applied to our MIL setting, because each bag consists of several instances and the instance-level label is unknown.In our work, we notice that compared to other instances,some instances usually play a much more important rolein determining the label of a bag, which we name as keyinstances . This observation further drives us to develop a novelinstance-level augmentation technique to generate “virtual”bags by gathering these key instances.In addition to the data augmentation technique, we alsoleverage self-supervised learning, a popular unsupervisedparadigm in recent studies. We notice that, although supervisedinformation of bag-level labels is limited due to data scarcity,there is a wealth of unsupervised information hidden inunannotated instances. Thanks to self-supervision, the patch-wise location can be further exploited so that the networkcan extract stronger instance features. As a result, the bag-level features of positive and negative samples can be furtherdifferentiated, further improving the performance of the MILmodel.In a word, to address the aforementioned challenges, ourmethod has three major components: 1) bag-level prediction,2) instance-level augmentation and 3) auxiliary loss based onself-supervised learning. In particular, we build a deep multipleinstance learning model with attention mechanism to predictthe bag-level label ( i.e. , severe or non-severe). Instances withhigher attention weights, that are expected to have largerinﬂuence on the bag-level label, can be regarded as keyinstances, while other instances are considered to be regularinstances. According to the learned attention, we randomlysample key instances and regular instances to generate virtualbags to enrich current training samples. In the training stage,we incorporate self-supervision into MIL model by adding anauxiliary self-supervised loss.

B. Bag-level Prediction

The ﬁrst component of our method – bag-level predictionaims to predict the label of a CT image as either severe or non-severe. A chest CT image consisting of hundreds of slices canbe divided into smaller sized patches, making the CT imageitself a bag with a single label (severe or non-severe) and thepatches instances. In this work, Y = 1 indicates that the imageis labeled as severe case while Y = 0 indicates that it is non-severe. Since these patches are non-overlapping, we assumethere is no dependency or sequential relationship among theinstances within a bag. Furthermore, K denotes the numberof instances in a bag, and we assume K could vary from bagto bag.The framework of this component is shown in Fig. 4. Weuse a convolutional neural network [40] to extract the featureembedding h k of each instance x k , where h k ∈ R M and M is the dimensionality of instance features. Suppose H = { h , . . . , h K } is a bag of K embeddings, the embedding ofbag X is calculated by attention-based MIL pooling proposedby Ilse et al. [16]: z = K X k =1 a k h k , (1) where a k = exp (cid:16) w ⊤ tanh (cid:16) Vh ⊤ k (cid:17) (cid:17)P Kj =1 exp (cid:16) w ⊤ tanh (cid:16) Vh ⊤ j (cid:17) (cid:17) , (2) w ∈ R L and V ∈ R L × M are the parameters to learn.The hyperbolic tangent tanh( · ) element-wise non-linearity isutilized to include both negative and positive values for propergradient ﬂow. Finally, we use a fully connected layer to decidethe label according to the bag feature. The categorical cross-entropy loss is used to optimize the MIL model deﬁned asfollows: L MIL = − N b N b X i =1 1 X c =0 δ ( Y i = c ) log (cid:16) P ( Y i = c ) (cid:17) , (3)where N b is the number of bags. δ ( y i = c ) is the indicatorfunction ( i.e. , δ ( y i = c ) = 1 when y i = c and 0 otherwise)and P ( Y i = c ) denotes the predicted probability.It is worth mentioning that large weights refer to keyinstances with relatively high conﬁdence, that are most rel-evant to the bag-level label. This means that not only canthe MIL model provide ﬁnal diagnostic results, it can alsohelp physicians to identify possible severe infection regions,which has a great clinical signiﬁcance for COVID-19 severityassessment.With the trained MIL model, we are able to automaticallyassess the severity of the disease with CT images. In the testingstages, we divide the CT image into non-overlapping patchesin the same way as in training. Along with the assessment, themodel also outputs the attention weights, that can help ﬁnd theregions relevant to severe infection. C. Instance-level Augmentation

For severity assessment, scare data signiﬁcantly deterioratesthe overall performance. What’s more, the imbalance of class, i.e. , the number of non-severe cases is much larger than thatof severe cases, is also harmful for learning a stable model.To confront these problems, we propose a novel data aug-mentation technique to generate “virtual bags” on original bagsto enrich current training process. By rethinking the attentionmechanism in multiple instance learning, we notice that thepatch with higher responses in attention usually indicates ahigher relation to its current class label. In positive bags,there are some instances with signiﬁcantly higher weights, asshown in Fig. 5. We consider them (positive) key instances while other instances regular instances . However, accordingto the experiments, in negative bags, all instances have similarlow weights, roughly conﬁrming the rule in traditional MIL.Therefore, we only take positive bags and corresponding keyinstances in them into consideration.In the training process, when a positive bag is correctly pre-dicted to be severe, instances with top- ⌊ αK ⌋ highest weightsin this bag will be appended to the list of key instances, where K is the number of instances in the bag. In the meantime,other instances are treated as regular instances. Consideringtime cost and memory usage, we only append instances withtop- ⌊ γN ⌋ lowest weights to the list of regular instances. In !" !" ,’%$-.’(/0(*%+ +’,--(./()* ,4(5%6’. 7’8’.’9/": ! " !" $ ! % ! Fig. 4. The framework of deep MIL model. Firstly, the instance features are extracted. Secondly, the attention weights of the instance features are determinedby the network. Then, the MIL pooling layer combines the instance features to generate a bag feature. Finally, the bag feature is mapped by a fully connected(FC) layer to decide the label. ! !" " " Fig. 5. An example of key instances in a positive bag. It indicates that thepatches are likely to be related to the severe infection regions. Note that, werescaled the attention weights of the patches in the same slice using a ′ k = a k / P i a i our observation, we notice that different positive bags mayhave different proportion of positive key instances. Therefore,in order to prevent regular instances from being mistakenas key instances, the proper parameter α should be set to arelatively small value as a strict rule to judge the key instance.In practice, the value is set to no larger than each positivebag’s actual proportion. For the generation of virtual positivebags, assuming the average number of instances per bag is ¯ K , we randomly sample ⌊ α ¯ K ⌋ key instances and ⌊ (1 − α ) ¯ K ⌋ regular instances, then pack them into a virtual bag. Amongthe parameters above, K and ¯ K are easy to obtain, while α and γ need to be set before training. The process of generatingvirtual bags is visualized in Fig. 6.With the help of attention mechanism, this process can beplugged in the training phase. In each training epoch, themodel generates virtual bags based on attention weights andthese virtual bags generated are further used as a part oftraining data in the next epoch. D. Auxiliary Self-supervised Loss

We also incorporate self-supervision into the MIL modelby adding an auxiliary self-supervised loss. By optimizingthe self-supervised loss, the model learns to exploit moreinformation from unannotated instances.In this work, we consider the following two pretext tasksfor the self-supervised loss: 1) to predict the relative locationof two patches from the same slice, which is a seminal taskin self-supervised learning as originally proposed in [30]; and2) to predict the absolute location of a single patch, which is !" !" *+,%&’" ,’+-.*/(01 -./0+123%4.152"/

Fig. 6. A sketch map of generating virtual bags. For positive bags, instanceswith high weights are appended to the list of key instances while instanceswith low weights to the list of regular instances. Virtual bags are generatedby randomly sampling key instances and regular instances. similar to the former task, but more suitable under the MILproblem setting. ! ! ! " ! ! $ ! % ! & ! ’ ! ( ! ) " ! " " " " $ " % " ( " & " ’ " ) " $ % ( & ’ ) * "! "" " Fig. 7. A CT slice is divided to 12 patches. For the relative patch locationtask, we are able to create 16 pairs of patches: ( a , a ) , . . . , ( a , a ) and ( b , b ) , . . . , ( b , b ) . For the absolute patch location task, we directly predicteach patch’s location among l , . . . l . SSL task 1: Relative Location Prediction.

Predicting therelative location of a pair of patches from the same image isa seminal task in self-supervised learning. More speciﬁcally,given a pair of patches, the task is to predict the location of thesecond patch with regard to the ﬁrst one, among eight possiblepositions, e.g. , “on the bottom left” or “on the top right”. Thistask is particularly suitable for the MIL setting, because thereare many pairs of patches in a bag, that we can predict therelative location of. To be more speciﬁc, for one slice, weare able to create 16 pairs of patches: ( a , a ) , . . . , ( a , a ) and ( b , b ) , . . . , ( b , b ) , as shown in Fig. 7. We extract therepresentation of each patch and then generate pair featuresby concatenation. We train a fully connected network G rφ ( · , · ) with parameters φ to predict the relative patch location of eachpair.The self-supervised loss of this relative location predictiontask is deﬁned as: L r SSL = 1 N s N s X s =1 16 X i =1 L CE (cid:16) G rφ ( p i ) , rloc ( p i ) (cid:17) , (4)where N s is the number of slices. p i stands for the pair ofpatches, speciﬁcally p , . . . , p for ( a , a ) , . . . , ( a , a ) and p , . . . , p for ( b , b ) , . . . , ( b , b ) . There are 16 pairs intotal. L CE ( · , · ) is the cross-entropy loss function and rloc ( p i ) is the ground truth of the relative patch location. SSL Task 2: Absolute Location Prediction.

Under theMIL setting of COVID-19 severity assessment, the CT slicescontaining two lungs are quite similar. This has made us torealize that the task can be designed in a more straightforwardway, i.e. , to predict the absolute location of a single patch inan entire CT slice, or more speciﬁcally, to predict the locationof a patch among 12 possible positions l , . . . l as shown inFig. 7. We also train a fully-connected network G aφ ( · ) withparameters φ to predict the absolute patch location.The self-supervised loss of this absolute location predictiontask is deﬁned as: L a SSL = 1 N s N s X s =1 12 X i =1 L CE (cid:16) G aφ ( x i ) , aloc ( x i ) (cid:17) , (5)where N s is the number of slices. x i stands for the patch inthe position of l i . aloc ( x i ) = l i is the ground truth of theabsolute patch location. There are 12 patches per slice.Formally, let L SSL be the either kind of self-supervised loss,the total loss of the training stage can be written as L total = L MIL + µL SSL µ , (6)where L MIL stands for the loss of MIL model ( i.e. , the bag-level prediction task), as deﬁned previously. The positive hy-perparameter µ controls the weight of self-supervised loss. Byoptimizing self-supervised loss, the instance feature extractorcan learn more informative features, further improving theperformance of the MIL model. Note that, only real bags willbe used for self-supervised learning tasks.IV. E XPERIMENT

In the section, we will report the quantitative and qualitativeresults of our method. First, we present the details of ourCOVID-19 dataset and the process of data preprocessing.Then, we discuss the experimental setup in our evaluation andprovide the implementation details of our method. After that,we conduct ablation studies, and compare our method withexisting methods, while analyzing the interpretability of ourmethod. Finally, we discuss some of the choices we have madein network structures, parameter values and pretext tasks.

A. Dataset

We collect a dataset, that contains chest CT images of229 patients with conﬁrmed COVID-19. For each patient, theseverity of COVID-19 is determined according to

Diagnosisand Treatment Protocol for COVID-19 (Trial Version 7) issuedby National Health Commission of the People’s Republic ofChina. The severity includes four types: mild, common, severeand critical. We categorize patients into two groups: non-severe group (mild and common) and severe group (severeand critical), because the number of patients with mild orcritical types is extremely small. Among these patients, 179are non-severe cases while 50 severe cases. The categories ofpatients are used as image-level labels of their correspondingCT images. Moreover, the gender distribution of the patientsis shown in Table I and their age distribution is shown in Fig.8.

TABLE IT HE G ENDER D ISTRIBUTION IN OUR STUDY

Gender Severe Non-severe TotalMale 32 92 124Female 18 87 105Total 50 179 229 [0,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90)Age01020304050 N u m b e r o f s a m p l e s Fig. 8. The age distribution of the patients in our dataset.

All chest CT images were acquired at the Second XiangyaHospital of Central South University and its collaboratinghospitals with different types of CT scanners, including Anke(ANATOM 16HD), GE (Bright speed S), Hitachi (ECLOS),Philips (Ingenuity CT iDOSE4) and Siemens (Somatom per-spective). The scanning parameters are as follows: 120 kVp,100-200 mAs, pitch 0.75-1.5 and collimation 1-5 mm.

B. Data Preprocessing

Though in the previous section, we have introduced how toprocess the CT images as MIL bags brieﬂy, when it comes tothe implementation, we give more details. The CT images areoriginally stored in DICOM ﬁles and converted to PNG ﬁleformat for further processing. Each PNG ﬁle corresponds to aCT slice. After the CT images have been sliced, those sliceswith few lung tissues are removed. For each slice, we locatethe bounding box of two lungs and crop the region containing two lungs. The cropping region is then resized to × and divided into 12 non-overlapping patches of size × .To remove inter-subject variation, we further perform min-maxnormalization on each patch individually.Eventually, we obtained a dataset consisting of 229 bags,179 of which are negative ( i.e. , non-severe cases) and 50of which are positive ( i.e. , severe cases). There are 49,632instances (patches) in total, with around 217 instances per bagon average. C. Experimental Setup

We employ the standard 10-fold cross-validation, whereeach sample would be tested at least once. Each experimentis performed 5 times and an average ( ± a standard deviation)is reported to avoid possible data-split bias.In our experiments, we use the following metrics to eval-uate the performance: accuracy, sensitivity (true positive rate,TPR), speciﬁcity (true negative rate, TNR), F1-Score and thearea under the receiver operating characteristic curve (AUC).Speciﬁcally, these metrics are deﬁned as:Accuracy = TP + TNTP + FP + FN + TN , (7)Sensitivity = TPTP + FN , (8)Speciﬁcity = TNTN + FP , (9)F1-score = TPTP + ( FP + FN ) , (10)where TP, FP, TN, and FN represent the True Positive, FalsePositive, True Negative and False Negative, respectively. D. Implementation Details

In this part, we provide the implementation details of thethree components of our method.

1) Bag-level Prediction

For bag-level prediction, we construct a deep attention-based MIL model. In order to keep the consistency with theprevious work [16], we choose LeNet [40] as the instancefeature extractor and the dimensionality of features is 512.In attention mechanism, the parameter L is set as 128. Afully connected (FC) layer with Sigmoid function works asa linear classiﬁer and the classiﬁcation threshold is set as0.5. All layers are initialized according to [41] and biasesare set as zero. The network architecture is shown in TableII. In Table II, conv (5,1,0)-36 means the size of kernelas 5, stride as 1, padding as 0 and the number of outputchannels as 36, respectively. The model is trained with theAdam optimization algorithm [42]. The hyperparameters ofthe optimization procedure are given in Table III.

2) Instance-level Augmentation

In virtual-bag generation, the value of parameter α is veryimportant. As aforementioned in Section IV.C, in our setting, α should be smaller than the actual proportion to prevent regularinstances from being mistaken for key instances. So we set α as 0.025, because it shows the greatest accuracy on the TABLE IIT

HE DETAILS OF OUR N ETWORK A RCHITECTURE

Layer Type1 conv(5,1,0)-36 + ReLU2 maxpool (2,2)3 conv(5,1,0)-36 + ReLU4 maxpool (2,2)5 conv(5,1,0)-48 + ReLU6 maxpool (2,2)7 fc-512 + ReLU8 MIL-attention-1289 fc-1 + Sigm

TABLE IIIT

HE SETTING OF PARAMETERS IN OUR EXPERIMENTS

Hyperparameters Value β , β γ just needto be relatively small, to make sure that no key instance getsinto the list of regular instances incorrectly. In our experiments, γ is ﬁxed as 0.2, which shows good performance accordingto our evaluation. Besides, in the beginning of the trainingphase, the MIL model is under-ﬁtting so it cannot providevery accurate weights. Therefore, in our experiment, the modelstarts to generate virtual bags from the th epoch to the lastepoch.

3) Self-supervised Loss

For the relative patch location task, given two patches, thenetwork G rφ ( · ) gets the concatenation of their feature vectorsas input to two fully connected layers. For the absolute patchlocation task, another network G aφ ( · ) consisting of two fullyconnected layers gets as input the feature of a single patch andpredicts its location. For both tasks, we set µ as 0.3. We alsooptimize the auxiliary loss with the Adam algorithm [42]. Thehyperparameters of the optimization algorithm are consistentwith those shown in Table III. E. Ablation Study

To evaluate the effectiveness of different components ofthe proposed method, we have conducted ablation studies. Wehave experimented on the following conﬁgurations: • (A) MIL Only: our method without data augmentationand self-supervised learning; • (B) MIL + Augmentation: our method without self-supervised learning; • (C) MIL + Self-supervised: our method without dataaugmentation; • (D) MIL + Both: our method .The values of hyperparameters involved have already beendescribed above. For self-supervised learning, we choose theabsolute patch location task as the pretext task, because exper-imental evidence shows that it outperforms the relative patchlocation task in our problem setting. See further discussion fordetails.The results of all these conﬁgurations are illustrated inTable IV. The statistical comparison ( i.e. , two-sample t-test)on AUC metrics is conducted and p-values are reportedbelow. Comparing (B) with (A), we ﬁnd that the proposeddata augmentation technique signiﬁcantly improves the overallperformance of the MIL model ( p = 0 . < . ), especiallythe sensitivity criteria important for COVID-19 diagnosis.Similarly, the comparison between (C) and (A) shows theauxiliary self-supervision loss also results in performance gain( p = 0 . < . ). Comparing (D) with other conﬁgura-tions (A, B and C), the MIL model incorporating both dataaugmentation and self-supervised learning achieves the bestperformance ( p = 0 . , . , . < . ). Fig. 9. Visualization of the bag-level features extracted in different conﬁgu-rations. The left corresponds to (A), while the right corresponds to (D). Redand green points stand for severe and non-severe cases, respectively.

The visualization in Fig. 9 indicates that our method canlearn more discriminative feature representations than theoriginal MIL model.

F. Comparison with Existing Methods

We have compared our method with the existing works byTang et al. ’s [1] and Yang et al. ’s [2], which share the sameproblem setting with ours. For the size of datasets, comparedwith chest CT images of 179 patients in Tang et al. ’s work andthose of 102 patients in Yang et al. ’s work, our work includesa larger dataset of 229 patients. In terms of data annotation, ourmethod works under weak annotation setting, with only image-level labels (severe or non-severe) available. However, theirworks need additional manual annotation besides image-levellabels. Tang et al. ’s work depends on 63 quantitative featurescalculated from accurate segmentation results of infectionregions. The segmentation network needs manual delineationsfor training. Yang et al. ’s work depends on manually deﬁnedseverity scores of lung regions provided by chest radiologists.Table V displays the comparison between our proposedmethod and the existing methods. Because their data andcodes are not accessible, the results in the ﬁrst two lines aredirectly reported from their papers. The third line shows thatour MIL model itself has achieved better performance in termsof accuracy and AUC metrics. As shown in the last line, ourproposed method with data augmentation and self-supervised learning reveals a superior performance on a larger datasetwhen compared with these two state-of-the-arts methods.

G. Efﬁciency of MIL Method

We implemented our experiments on one Nvidia GeForceRTX 2080 Ti GPU. In 10-fold cross-validation, for each datasplit, training our MIL model (with data augmentation andauxiliary self-supervised loss) on 206 samples would take . ± . seconds in average, and testing on 23 sampleswould take less than 1 second. It is shown that the proposedmethod is quite efﬁcient in computation. H. Interpretability of MIL Method

Along with a predicted label, the MIL model also outputsattention weights for each patch. Although the model is notdesigned to accurately segment the lesions, it can still help toindicate the regions relevant to severe infection. In Fig. 10,we show that the attention weights can be useful for ﬁndingsevere infection regions.

I. Method Designing Details

Now we would like to discuss some of the choices we madein designing our method. Experiments on the validation setshow that the values of parameters α and µ , as well as theselection of pretext task, can affect the performance of ourmethod.

1) Bag-level Prediction

For the instance feature extractor, we do not use deeperResNet [43] or DenseNet [44], because the experiment showsthat deeper networks cannot signiﬁcantly improve the perfor-mance, but rather increase the time consumed instead. ForMIL attention pooling, we test the following dimensions (L):64, 128 and 256. The differences in dimensions only result inminor changes of the model’s performance.

2) Instance-level Augmentation

For the data augmentation technique, we evaluate the per-formance for different α , the results of which are illustrated inFig. 11. Our experiment shows that different γ doesn’t bringgreat variation on the model’s performance.

3) Self-supervised Loss

We have evaluated the performance of two pretext tasks, andthe results on are shown in Table VI. According to the results,utilizing the absolute patch location task can approach betterperformance than utilizing the relative patch location task ( p =0 . < . ). The reason could be that different lung regionsplay different roles in COVID-19 severity assessment as shownby an existing study [1], while the absolute patch locationtask may help to extract both high-level semantics and spatialinformation. For the absolute patch location task, we furtherconduct experiments to ﬁnd the best value of µ , and the resultsare illustrated in Fig. 12.There exist some other pretext tasks in self-supervisedlearning, such as colorization [45], denoising [46], imagerestoration [47] and so on. However, we argue that the patchlocation prediction is applicable for our method, because it is TABLE IVT HE R ESULTS OF T HE A BLATION S TUDY

Method Accuracy Sensitivity Speciﬁcity F1-Score AUC(A) MIL Only . ± .

029 0 . ± .

084 0 . ± .

031 0 . ± .

064 0 . ± . (B) MIL + Augmentation . ± .

025 0 . ± .

022 0 . ± .

036 0 . ± .

045 0 . ± . (C) MIL + Self-supervision . ± .

029 0 . ± .

129 0 . ± .

009 0 . ± .

079 0 . ± . (D) MIL + Both ± ± ± ± ± TABLE VC

OMPARISONS BETWEEN O UR P ROPOSED M ETHOD AND T HE E XISTING A PPROACHES . N

OTE THAT OUR METHOD REQUIRES NO ADDITION ANNOTATIONOTHER THAN IMAGE - LEVEL LABELS , WHILE T ANG et al. ’ S WORK REQUIRES ACCURATE SEGMENTATION OF INFECTION REGION AND Y ANG et al. ’ SWORK DEPENDS ON MANUALLY DEFINED SEVERITY SCORES . Method Accuracy Sensitivity Speciﬁcity F1-Score AUCTang et al. ’s [1] 0.875 0.933 0.745 - 0.910Yang et al. ’s [2] 0.833 0.940 - 0.892MIL Only (Ours) . ± .

029 0 . ± .

084 0 . ± .

031 0 . ± .

064 0 . ± . MIL + Both (Ours) ± ± ± ± ± !" Fig. 10. Visualization of the attention mechanism in our method. Presented above are some examples of the (slice-wise rescaled) attention weights of patches.Severe infection regions identiﬁed by experts are marked with yellow boxes. It can be seen that the patches with high weights are probably relevant to severeinfection. TABLE VIP

ERFORMANCE OF D IFFERENT P RETEXT T ASKS

Method Accuracy Sensitivity Speciﬁcity F1-Score AUCBaseline . ± .

029 0 . ± .

084 0 . ± .

031 0 . ± .

064 0 . ± . Relative Patch Location . ± .

004 0 . ± .

114 0 . ± .

032 0 . ± .

024 0 . ± . Absolute Patch Location ± ± ± ± ± A cc u r a cy w/o augw/ augbest α Fig. 11. Performance of different α . The best α is 0.025. A cc u r a cy w/o sslw/ sslbest μ Fig. 12. Performance of different µ . The best µ is 0.3. patch-oriented pretext task. Considering that patches have beenpreviously deﬁned and cropped in the MIL setting, no moreimage transformation is required. Besides, in our preliminaryexperiments, we noted that other pretext tasks might not beapplicable in severity assessment of COVID-19 due to theproperty of strong spatial relation and low CT contrast.V. D ISCUSSION

For future directions of our current study, we consider fourfollowing aspects to improve our work: • Data preprocessing.

We will consider a better way ofdata preprocessing. In the current study, we mainly focuson designing the learning method, thus implementing thedata preprocessing in a simple way. In our future work,we will incorporate automatic segmentation methods toobtain more ﬁne-grained patches, for further performanceimprovement. In the meanwhile, we can also reduce alarge number of irrelevant patches to further improve theeffectiveness and efﬁciency of our method. • Self-supervised representation.

In this study, the efﬁ-cacy of self-supervise learning has been evaluated. In ourfuture work, we will further incorporate more advancedmethods of self-supervised contrastive learning. In thisway, we can exploit informative representation fromunsupervised manner. • Longitudial information.

Longitudial information couldbeneﬁt the prediction of changing trend for better severityassessment. In our future work, we will also incorporatelongitudinal CT scans for severity assessment, to providebetter treatment and follow-up of COVID-19 patients. • Manual delineation.

Since manual annotation is labori-ous, we will investigate semi-supervised learning modelto further alleviate the requirement of amount of annota-tion.For possible clinical applications, we believe that our pro-posed method has great potential. First, by training with asmall number of weak-annotated CT images, our proposedmethod can predict the severity of COVID-19 in a highaccuracy. Second, our proposed method provides a powerfulfeature extractor for CT images. Speciﬁcally, the bag featuresare actually features of CT image, and can act as imagingattributes, which can be combined with clinical/biologicalattributes for other tasks ( e.g. , patients’ outcome prediction).Moreover, our proposed method can be extended to otherproblems, in which the challenges of weak annotation andinsufﬁcient data also exist, besides COVID-19 severity assess-ment. VI. C

ONCLUSION

In this paper, we investigate a challenging clinical task offast and accurately predicting the severity level of COVID-19. We observe two issues that may obstruct the COVID-19severity assessment: weak annotation and insufﬁcient data. Tomeet these challenges, we develop a deep attention-based MILmethod combined with data augmentation and self-supervisedlearning. Experimental results successfully demonstrate theeffectiveness of our proposed components, including 1) theMIL model for bag-level prediction, 2) the instance-levelaugmentation technique by generating virtual positive bags,and 3) the auxiliary self-supervised loss for extracting morediscriminative features. Also, our approach shows remarkablybetter performance when compared with the existing methods.A

CKNOWLEDGMENT

The work was supported by the National Key Research andDevelopment Program of China (2019YFC0118300), NationalNatural Science Foundation of China (61673203, 81927808).The work was also supported by the Key EmergencyProject of Pneumonia Epidemic of novel coronavirus infection(2020SK3006), Emergency Project of Prevention and Controlfor COVID-19 of Central South University (160260005) andFoundation from Changsha Scientiﬁc and Technical Bureau,China (kq2001001). R

EFERENCES[1] Z. Tang, W. Zhao, X. Xie, Z. Zhong, F. Shi, J. Liu, and D. Shen, “Sever-ity assessment of coronavirus disease 2019 (covid-19) using quantitativefeatures from chest ct images,” arXiv preprint arXiv:2003.11988 , 2020.[2] R. Yang, X. Li, H. Liu, Y. Zhen, X. Zhang, Q. Xiong, Y. Luo, C. Gao,and W. Zeng, “Chest ct severity score: an imaging tool for assessingsevere covid-19,”

Radiology: Cardiothoracic Imaging , vol. 2, no. 2,p. e200047, 2020.[3] F. Shan, Y. Gao, J. Wang, W. Shi, N. Shi, M. Han, Z. Xue, D. Shen,and Y. Shi, “Abnormal lung quantiﬁcation in chest ct images of covid-19patients with deep learning and its application to severity prediction,”

Med. Phys. , 2020.[4] K. Li, Y. Fang, W. Li, C. Pan, P. Qin, Y. Zhong, X. Liu, M. Huang,Y. Liao, and S. Li, “Ct image visual quantitative evaluation and clinicalclassiﬁcation of coronavirus disease (covid-19),”

European radiology ,pp. 1–10, 2020. [5] H. Chao, X. Fang, J. Zhang, F. Homayounieh, C. D. Arru, S. R.Digumarthy, R. Babaei, H. K. Mobin, I. Mohseni, L. Saba, et al. ,“Integrative analysis for covid-19 patient outcome prediction,” MedicalImage Analysis , vol. 67, p. 101844, 2020.[6] G. Chassagnon, M. Vakalopoulou, E. Battistella, S. Christodoulidis, T.-N. Hoang-Thi, S. Dangeard, E. Deutsch, F. Andre, E. Guillo, N. Halm, et al. , “Ai-driven quantiﬁcation, staging and outcome prediction ofcovid-19 pneumonia,”

Medical Image Analysis , vol. 67, p. 101860, 2020.[7] Z.-H. Zhou, “A brief introduction to weakly supervised learning,”

National Science Review , vol. 5, no. 1, pp. 44–53, 2018.[8] J. Amores, “Multiple instance classiﬁcation: Review, taxonomy andcomparative study,”

Artiﬁcial intelligence , vol. 201, pp. 81–105, 2013.[9] P. O. Pinheiro and R. Collobert, “From image-level to pixel-levellabeling with convolutional networks,” in

CVPR , pp. 1713–1721, 2015.[10] J. Wu, Y. Yu, C. Huang, and K. Yu, “Deep multiple instance learningfor image classiﬁcation and auto-annotation,” in

CVPR , pp. 3460–3469,2015.[11] B. Babenko, N. Verma, P. Doll´ar, and S. J. Belongie, “Multiple instancelearning with manifold bags,” in

ICML , 2011.[12] M. Sun, T. X. Han, M.-C. Liu, and A. Khodayari-Rostamabad, “Multipleinstance learning convolutional neural networks for object recognition,”in ,pp. 3270–3275, IEEE, 2016.[13] V. Cheplygina, M. de Bruijne, and J. P. Pluim, “Not-so-supervised:a survey of semi-supervised, multi-instance, and transfer learning inmedical image analysis,”

Medical image analysis , vol. 54, pp. 280–296,2019.[14] G. Quellec, G. Cazuguel, B. Cochener, and M. Lamard, “Multiple-instance learning for medical image and video analysis,”

IEEE reviewsin biomedical engineering , vol. 10, pp. 213–234, 2017.[15] K. Sirinukunwattana, S. E. A. Raza, Y.-W. Tsang, D. R. Snead, I. A.Cree, and N. M. Rajpoot, “Locality sensitive deep learning for detectionand classiﬁcation of nuclei in routine colon cancer histology images,”

IEEE transactions on medical imaging , vol. 35, no. 5, pp. 1196–1206,2016.[16] M. Ilse, J. Tomczak, and M. Welling, “Attention-based deep multipleinstance learning,” in

International Conference on Machine Learning ,pp. 2127–2136, 2018.[17] Z. Han, B. Wei, Y. Hong, T. Li, J. Cong, X. Zhu, H. Wei, and W. Zhang,“Accurate screening of covid-19 using attention-based deep 3d multipleinstance learning,”

IEEE Transactions on Medical Imaging , vol. 39,no. 8, pp. 2584–2594, 2020.[18] M. Kandemir, M. Haussmann, F. Diego, K. T. Rajamani, J. VanDer Laak, and F. A. Hamprecht, “Variational weakly supervised gaussianprocesses.,” in

BMVC , pp. 71.1–71.12, 2016.[19] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz,“Patch-based convolutional neural network for whole slide tissue imageclassiﬁcation,” in

CVPR , pp. 2424–2433, 2016.[20] L. Taylor and G. Nitschke, “Improving deep learning with genericdata augmentation,” in , pp. 1542–1547, IEEE, 2018.[21] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” arXiv preprint arXiv:1710.09412 , 2017.[22] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasingdata augmentation.,” in

AAAI , pp. 13001–13008, 2020.[23] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,” arXiv preprint arXiv:1708.04552 , 2017.[24] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan,“Synthetic data augmentation using gan for improved liver lesion clas-siﬁcation,” in , pp. 289–293, IEEE, 2018.[25] L. Perez and J. Wang, “The effectiveness of data augmentation in imageclassiﬁcation using deep learning,” arXiv preprint arXiv:1712.04621 ,2017. [26] J. Lemley, S. Bazrafkan, and P. Corcoran, “Smart augmentation learningan optimal data augmentation strategy,”

Ieee Access , vol. 5, pp. 5858–5869, 2017.[27] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaug-ment: Learning augmentation strategies from data,” in

CVPR , pp. 113–123, 2019.[28] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning representationsfor automatic colorization,” in

European conference on computer vision ,pp. 577–593, Springer, 2016.[29] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in

European conference on computer vision , pp. 649–666, Springer, 2016.[30] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual represen-tation learning by context prediction,” in

ICCV , pp. 1422–1430, 2015.[31] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in

European Conference on ComputerVision , pp. 69–84, Springer, 2016.[32] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre-sentation learning by predicting image rotations,” arXiv preprintarXiv:1803.07728 , 2018.[33] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in

CVPR , pp. 2536–2544, 2016.[34] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrastfor unsupervised visual representation learning,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,pp. 9729–9738, 2020.[35] Z. Zhou, V. Sodha, J. Pang, M. B. Gotway, and J. Liang, “Modelsgenesis,”

Medical Image Analysis , vol. 67, p. 101840, 2021.[36] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self-supervisedgenerative adversarial networks,” arXiv preprint arXiv:1811.11212 ,2018.[37] S. Gidaris, A. Bursuc, N. Komodakis, P. P´erez, and M. Cord, “Boostingfew-shot visual learning with self-supervision,” in

ICCV , pp. 8059–8068,2019.[38] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmen-tation for deep learning,”

Journal of Big Data , vol. 6, no. 1, p. 60,2019.[39] T. Qin, W. Li, Y. Shi, and Y. Gao, “Unsupervised few-shotlearning via distribution shift-based augmentation,” arXiv preprintarXiv:2004.05805 , 2020.[40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[41] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,” in

Proceedings of the internationalconference on artiﬁcial intelligence and statistics , pp. 249–256, 2010.[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , pp. 770–778, 2016.[44] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

CVPR , pp. 4700–4708, 2017.[45] T. Ross, D. Zimmerer, A. Vemuri, F. Isensee, M. Wiesenfarth, S. Bo-denstedt, F. Both, P. Kessler, M. Wagner, B. M¨uller, et al. , “Exploitingthe potential of unlabeled endoscopic video data with self-supervisedlearning,”

International journal of computer assisted radiology andsurgery , vol. 13, no. 6, pp. 925–933, 2018.[46] V. Alex, K. Vaidhya, S. Thirunavukkarasu, C. Kesavadas, and G. Kr-ishnamurthi, “Semisupervised learning using denoising autoencoders forbrain lesion detection and segmentation,”

Journal of Medical Imaging ,vol. 4, no. 4, p. 041311, 2017.[47] L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueck-ert, “Self-supervised learning for medical image analysis using imagecontext restoration,”