[PDF] Big Self-Supervised Models Advance Medical Image Classification

Abstract

Self-supervised pretraining followed by supervised fine-tuning has seen success in image recognition, especially when labeled examples are scarce, but has received limited attention in medical image analysis. This paper studies the effectiveness of self-supervised learning as a pretraining strategy for medical image classification. We conduct experiments on two distinct tasks: dermatology skin condition classification from digital camera images and multi-label chest X-ray classification, and demonstrate that self-supervised learning on ImageNet, followed by additional self-supervised learning on unlabeled domain-specific medical images significantly improves the accuracy of medical image classifiers. We introduce a novel Multi-Instance Contrastive Learning (MICLe) method that uses multiple images of the underlying pathology per patient case, when available, to construct more informative positive pairs for self-supervised learning. Combining our contributions, we achieve an improvement of 6.7% in top-1 accuracy and an improvement of 1.1% in mean AUC on dermatology and chest X-ray classification respectively, outperforming strong supervised baselines pretrained on ImageNet. In addition, we show that big self-supervised models are robust to distribution shift and can learn efficiently with a small number of labeled medical images.

Full PDF

BBig Self-Supervised Models Advance Medical Image Classiﬁcation

Shekoofeh Azizi * , Basil Mustafa * , Fiona Ryan † , Zachary Beaver, Jan Freyberg, Jonathan Deaton,Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, Vivek Natarajan, Mohammad Norouzi { shekazizi, skornblith, iamtingchen, natviv, mnorouzi } @google.com Google Research and Health

Abstract

Self-supervised pretraining followed by supervised ﬁne-tuning has seen success in image recognition, especiallywhen labeled examples are scarce, but has received lim-ited attention in medical image analysis. This paper studiesthe effectiveness of self-supervised learning as a pretrain-ing strategy for medical image classiﬁcation. We conductexperiments on two distinct tasks: dermatology skin con-dition classiﬁcation from digital camera images and multi-label chest X-ray classiﬁcation, and demonstrate that self-supervised learning on ImageNet, followed by additionalself-supervised learning on unlabeled domain-speciﬁc med-ical images signiﬁcantly improves the accuracy of medicalimage classiﬁers. We introduce a novel Multi-Instance Con-trastive Learning (MICLe) method that uses multiple im-ages of the underlying pathology per patient case, whenavailable, to construct more informative positive pairs forself-supervised learning. Combining our contributions, weachieve an improvement of 6.7% in top-1 accuracy andan improvement of 1.1% in mean AUC on dermatologyand chest X-ray classiﬁcation respectively, outperformingstrong supervised baselines pretrained on ImageNet. In ad-dition, we show that big self-supervised models are robustto distribution shift and can learn efﬁciently with a smallnumber of labeled medical images.

1. Introduction

Learning from limited labeled data is a fundamentalproblem in machine learning, which is especially crucialfor medical image analysis because annotating medical im-ages is time-consuming and expensive. Two common ap-proaches to learning from limited labeled data include:(1) supervised pretraining on a large labeled dataset such asImageNet, (2) self-supervised pretraining using contrastivelearning ( e.g., [16, 7, 8]) on unlabeled data. After pretrain- * Work done as part of the Google AI Residency Program. † Former intern at Google. Currently at Georgia Institute of Technology. (1) Self-supervised learning on unlabeled natural images(3) Supervised fine-tuning on labeled medical images(2) Self-supervised learning on unlabeled medical imagesand

Multi-Instance Contrastive Learning (MICLe) ifmultiple images of each medical condition are available

Unlabeled chest x-raysUnlabeled dermatology images Labeled chest x-raysLabeled dermatology images

Figure 1:

Our approach comprises three steps: (1) Self-supervised pretraining on unlabeled ImageNet using SimCLR [7].(2) Additional self-supervised pretraining using unlabeled medicalimages. If multiple images of each medical condition are avail-able, a novel Multi-Instance Contrastive Learning (MICLe) is usedto construct more informative positive pairs based on different im-ages. (3) Supervised ﬁne-tuning on labeled medical images. Notethat unlike step (1), steps (2) and (3) are task and dataset speciﬁc. ing, supervised ﬁne-tuning on the target labeled dataset ofinterest is used. While ImageNet pretraining is ubiquitousin medical image analysis [45, 31, 30, 28, 15, 20], the useof self-supervised approaches has received limited atten-tion. Self-supervised approaches are attractive because theyenable the use of unlabeled domain-speciﬁc images duringpretraining to learn more relevant representations.This paper studies self-supervised learning for medi-cal image analysis and conducts a fair comparison be-tween self-supervised and supervised pretraining on two1 a r X i v : . [ ee ss . I V ] J a n esNet-50 (4x) ResNet-152 (2x)0.600.620.640.660.680.700.72 D e r m a t o l o g y T o p - A cc u r a c y SupervisedSelf-Supervised

ResNet-50 (4x) ResNet-152 (2x)0.7400.7450.7500.7550.7600.7650.7700.7750.780 C h e X p e r t M e a n A U C SupervisedSelf-Supervised

Figure 2:

Comparison of supervised and self-supervised pretrain-ing, followed by supervised ﬁne-tuning using two architectures ondermatology and chest X-ray classiﬁcation. Self-supervised learn-ing utilizes unlabeled domain-speciﬁc medical images and signif-icantly outperforms supervised ImageNet pretraining. distinct medical image classiﬁcation tasks: (1) dermatol-ogy skin condition classiﬁcation from digital camera im-ages, (2) multi-label chest X-ray classiﬁcation among ﬁvepathologies based on the CheXpert dataset [23]. We ob-serve that self-supervised pretraining outperforms super-vised pretraining, even when the full ImageNet dataset(14M images and 21.8K classes) is used for supervised pre-training. We attribute this ﬁnding to the domain shift anddiscrepancy between the nature of recognition tasks in Im-ageNet and medical image classiﬁcation. Self-supervisedapproaches bridge this domain gap by leveraging in-domainmedical data for pretraining and they also scale gracefullyas they do not require any form of class label annotation.An important component of our self-supervised learn-ing framework is a novel

Multi-Instance Contrastive Learn-ing (MICLe) strategy that helps adapt contrastive learningto multiple images of the underlying pathology per pa-tient case. Such multi-instance data is often available inmedical imaging datasets – e.g., frontal and lateral viewsof chest x-rays/mammograms, retinal fundus images fromeach eye, etc . Given multiple images of a given patient case,we propose to construct a positive pair for self-supervisedcontrastive learning by drawing two crops from two dis-tinct images of the same patient case. Such images maybe taken from different viewing angles and show differ-ent body parts with the same underlying pathology. Thispresents a great opportunity for self-supervised learning al-gorithms to learn representations that are robust to changesof viewpoint, imaging conditions, and other confoundingfactors in a direct way. MICLe does not require class labelinformation and only relies on different images of an under-lying pathology, the type of which may be unknown.Fig. 1 depicts the proposed self-supervised learning ap-proach, and Fig. 2 shows the summary of results. Our keyﬁndings and contributions include:• We investigate the choice of datasets for self-supervisedpretraining and ﬁnd that pretraining on ImageNet is com- plementary to pretraining on unlabeled medical images, i.e., best results are achieved when both are combined.• We propose Multi-Instance Contrastive Learning (MI-CLe) to leverage the potential availability of multiple im-ages per medical condition. We ﬁnd that MICLe signif-icantly improves the accuracy of skin condition classiﬁ-cation, yielding state-of-the-art results.• Our careful empirical study on two distinct datasets sug-gests that self-supervised pretraining often outperformssupervised pretraining on ImageNet. We show that self-supervised pretraining is particularly effective for semi-supervised learning, i.e., when additional unlabeled ex-amples are available for pretraining. In this setting, weare able to match the baseline performance using only20% of the available labels for the dermatology task.• We combine our contributions to achieve an improve-ment of 6.7% in top-1 accuracy on dermatology skincondition classiﬁcation and an improvement of 1.1% inmean AUC on chest x-ray classiﬁcation, outperformingstrong supervised baselines pretrained on ImageNet.• We demonstrate that self-supervised models are robustand generalize better than baseslines when subjected toshifted test sets, without ﬁne-tuning. Such behavior isdesirable for deployment in a real-world clinical setting.

2. Related Work

Transfer Learning for Medical Image Analysis.

De-spite the differences in image statistics, scale, and task-relevant features, transfer learning from natural images iscommonly used in medical image analysis [28, 30, 31, 45],and multiple empirical studies show that this improvesperformance [1, 15, 20]. However, detailed investiga-tions from Raghu et al. [36] of this strategy indicate thisdoes not always improve performance in medical imagingcontexts. They, however, do show that transfer learningfrom ImageNet can speed up convergence, and is particu-larly helpful when the medical image training data is lim-ited. Importantly, the study used relatively small archi-tectures, and found pronounced improvements with smallamounts of data especially when using their largest archi-tecture of ResNet-50 (1 × ) [18]. Transfer learning from in-domain data can help alleviate the domain mismatch issue.For example, [6, 20, 25, 13] report performance improve-ments when pretraining on labeled data in the same do-main. However, this approach is often infeasible for manymedical tasks in which labeled data is expensive and time-consuming to obtain. Recent advances in self-supervisedlearning provide a promising alternative enabling the use ofunlabeled medical data that is often easier to procure. Self-supervised Learning.

Initial work in self-supervisedrepresentation learning focused on the problem of learn-ing embeddings without labels such that a low-capacity(commonly linear) classiﬁer operating on these embeddings2igure 3: An illustrations of our self-supervised pretraining for medical image analysis. When a single image of a medicalcondition is available, we use standard data augmentation to generate two augmented views of the same image. Whenmultiple images are available, we use two distinct images to directly create a positive pair of examples and adopt lightweightaugmentations*. We call the latter approach Multi-Instance Contrastive Learning (MICLe).could achieve high classiﬁcation accuracy [12, 14, 34, 48].

Contrastive self-supervised methods such as instance dis-crimination [44], CPC [21, 35], Deep InfoMax [22], Ye etal. [46], AMDIM [2], CMC [40], MoCo [17, 9], PIRL [32],and SimCLR [7, 8] were the ﬁrst to achieve linear classiﬁ-cation accuracy approaching that of end-to-end supervisedtraining. Recently, these methods have been harnessed toachieve dramatic improvements in label efﬁciency for semi-supervised learning. Speciﬁcally, one can ﬁrst pretrain ina task-agnostic, self-supervised fashion using all data, andthen ﬁne-tune on the labeled subset in a task-speciﬁc fash-ion with a standard supervised objective [7, 8, 21]. Chen et al. [8] show that this approach beneﬁts substantiallyfrom large (high-capacity) models for pretraining and ﬁne-tuning, but after a large model is trained, it can be distilledto a much smaller model with little loss in accuracy.Our Multi-Instance Contrastive Learning approach isalso related to previous work that uses multiple views forcontrastive learning. Tschannen et al. [41] use the multipleviews naturally arising due to temporal variation in videos,employing noise-contrastive estimation to learn visual rep-resentations. Other work has used contrastive learning withviews from multiple cameras [37].

Self-supervision for Medical Image Analysis.

Althoughself-supervised learning has only recently become viableon standard image classiﬁcation datasets, it has alreadyseen some application within the medical domain. Whilesome works have attempted to design domain-speciﬁc pre-text tasks [3, 39, 52, 51], other works concentrate on tai-loring contrastive learning to medical data [11, 19, 26, 50].Most closely related to our work, Sowrirajan et al. [38] ex- plore the use of MoCo pretraining for semi-supervised clas-siﬁcation on the CheXpert dataset.Several recent publications investigate semi-supervisedlearning for medical imaging tasks ( e.g., [10, 27, 42, 49]).These methods are complementary to ours, and we believecombining self-training and self-supervised pretraining isan interesting avenue for future research ( e.g., [8]).

3. Self-Supervised Pretraining

Our approach comprises the following steps. First, weperform self-supervised pretraining on unlabeled imagesusing contrastive learning to learn visual representations.For contrastive learning we use a combination of unlabeledImageNet dataset and task speciﬁc medical images. Then, ifmultiple images of each medical condition are available theMulti-Instance Contrastive Learning (MICLe) is used foradditional self-supervised pretraining. Finally, we performsupervised ﬁne-tuning on labeled medical images. Figure 1shows the summary of our proposed method.

To learn visual representations effectively with unlabeledimages, we adopt SimCLR [7, 8], a recently proposed ap-proach based on contrastive learning. SimCLR learns rep-resentations by maximizing agreement [4] between differ-ently augmented views of the same data example via a con-trastive loss in a hidden representation of neural nets.Given a randomly sampled mini-batch of images, eachimage x i is augmented twice using random crop, color dis-tortion and Gaussian blur, creating two views of the same3xample x k − and x k . The two images are encoded viaan encoder network f ( · ) (a ResNet [18]) to generate rep-resentations h k − and h k . The representations are thentransformed again with a non-linear transformation network g ( · ) (a MLP projection head), yielding z k − and z k thatare used for the contrastive loss.With a mini-batch of encoded examples, the contrastiveloss between a pair of positive example i, j (augmentedfrom the same image) is given as follows: (cid:96) NT - Xent i,j = − log exp(sim( z i , z j ) /τ ) (cid:80) Nk =1 [ k (cid:54) = i ] exp(sim( z i , z k ) /τ ) , (1)Where sim( · , · ) is cosine similarity between two vectors,and τ is a temperature scalar. In medical image analysis, it is common to utilize mul-tiple images per patient to improve classiﬁcation accuracyand robustness. Such images may be taken from differentviewpoints or under different lighting conditions, providingcomplementary information for medical diagnosis.When multiple images of a medical condition are avail-able as part of the training dataset, we propose to learn rep-resentations that are invariant not only to different augmen-tations of the same image, but also to different images ofthe same medical pathology.Accordingly, after pretrainingwith standard SimCLR on two augmented views of eachimage, we conduct another self-supervised learning stage,where positive pairs are constructed by drawing two cropsfrom two different images of the same patient as demon-strated in Fig. 3. In this case, the objective still takes theform of Eq. (1), but images contributing to each positivepair are distinct. In standard SimCLR to construct a mini-batch of N representations, one uses N images each ofwhich is augmented twice. In MICLe, we use a minibatchof N pairs of related images, and since images are distinctwe use lightweight data augmentation. Additional detailsregarding augmentation selection in MICLe is provided inAppendix B.1.2.Leveraging multiple images of the same condition usingthe contrastive loss helps the model learn representationsthat are more robust to the change of viewpoint, lightingconditions, and other confounding factors. We ﬁnd thatmulti-instance contrastive learning signiﬁcantly improvesthe accuracy and helps us achieve the state-of-the-art resulton the dermatology condition classiﬁcation task.

4. Experiment Setup

We consider two popular medical imaging tasks for thisstudy. The ﬁrst task is in the dermatology domain and in- volves identifying skin conditions from digital camera im-ages. The second task involves multi-label classiﬁcation ofchest X-rays among ﬁve pathologies. We chose these tasksas they embody many common characteristics of medicalimaging tasks like imbalanced data and pathologies of in-terest restricted to small local patches. At the same time,they are also quite diverse in terms of the type of images,label space and task setup. For example, the dermatologyimages are visually similar to natural images whereas thechest X-rays are gray-scale and have standardized views.This, in turn, helps us probe the generality of our proposedmethods.

Dermatology.

For the dermatology task, we follow the ex-periment setup and dataset of [28]. The dataset was col-lected and de-identiﬁed by a US based teledermatology ser-vice with images of skin conditions taken using consumergrade digital cameras. The images are heterogeneous in na-ture and exhibit signiﬁcant variations in terms of pose, light-ing, blur and body parts. The background also contains var-ious noise artifacts like clothing and walls which adds to thechallenge. The ground truth labels were aggregated from apanel of several US-board certiﬁed dermatologists who pro-vided differential diagnosis of skin conditions in each case.In all, the dataset has cases from a total of 12,306 uniquepatients. Each case includes between one to six images.This was further split into development and test sets ensur-ing no patient overlap between the two. Cases with the oc-currence of multiple skin conditions or having poor qual-ity images were ﬁltered out. The ﬁnal train, validation andtest set have a total of 15,340 cases, 1190 cases, and 4,146cases, respectively. There are 419 unique skin condition la-bels in the dataset. For the purpose of model development,we identiﬁed and use the most common 26 skin conditionsand group the rest in an additional ’Other’ class leading toa ﬁnal label space of 27 classes for the model. We refer tothis as

Derm dataset in the subsequent sections.We also use an additional de-identiﬁed

Derm-External dataset collected in clinics in Australia to evaluate the gen-eralization performance of our proposed method under dis-tribution shift. This dataset is primarily focused on skincancers and the ground truth labels are obtained from biop-sies. The distribution shift in the labels make this a partic-ular challenging dataset to evaluate the zero-shot (i.e. with-out any additional ﬁne-tuning) transfer performance of themodel. Additional details are provided in the Appendix A.1.For SimCLR pretraining, we combine the images fromDerm-train and Derm-External datasets, discarding the skincondition labels. We also had access to additional unlabeledimages from both these dataset sources leading to a total of454,295 images for self-supervised pretraining. We refer tothis as the

Derm-Unlabeled dataset. For MICLe pretrain-ing, we only use the images coming from the 15,340 casesof the train split of the Derm dataset.4 hest X-rays.

CheXpert [23] is a large open source datasetof de-identiﬁed chest radiograph (X-ray) images. Thedataset consists of a set of 224,316 chest radiographs com-ing from 65,240 unique patients. The ground truth labelswere automatically extracted from radiology reports andcorrespond to a label space of 14 radiological observations.The validation set consists of 234 manually annotated chestX-rays. Given the small size of the validation dataset andfollowing [33, 36] suggestion, for the downstream task eval-uations we randomly re-split the training set into 67,429training images, 22,240 validation images, and 33,745 testimages. We train the model to predict the ﬁve patholo-gies used by Irvin and Rajpurkar et al. [23] in a multi-labelclassiﬁcation task setting. For SimCLR pretraining for thechest X-ray domain, we only consider images coming fromthe train set of the CheXpert dataset discarding the labels.We refer to this as the

CheXpert-Unlabeled dataset. Ad-ditional details are provided in the Appendix A.2. In ad-dition, we also use the NIH chest X-ray dataset to evaluatethe zero-shot transfer performance which consist of 112,120de-identiﬁed X-rays from 30,805 unique patients. Addi-tional details on the dataset can be found here [43].

To assess the effectiveness of self-supervised pretrain-ing using big neural nets, as suggested in [7], we inves-tigate ResNet-50 (1 × ), ResNet-50 (4 × ), and ResNet-152(2 × ) architectures as our base encoder networks. Fol-lowing SimCLR [7], two fully connected layers are usedto map the output of ResNets to a 128-dimensional em-bedding, which is used for contrastive learning. We alsouse LARS optimizer [47] to stabilize training during pre-training. We perform SimCLR pretraining on Derm-Unlabeled and CheXpert-Unlabeled dataset, both with andwithout initialization from ImageNet self-supervised pre-trained weights. We indicate pretraining initialized usingself-supervised ImageNet weights, as ImageNet → Derm,and ImageNet → CheXpert in the following sections.Unless otherwise speciﬁed, for the dermatology pretrain-ing task, due to similarity of dermatology images to naturalimages, we use the same data augmentation used to generatepositive pairs in SimCLR. This includes random color aug-mentation (strength=1.0), crops with resize, Gaussian blur,and random ﬂips. We ﬁnd that the batch size of 512 andlearning rate of 0.3 works well in this setting. Using thisprotocol, all of models were pretrained up to 150,000 stepsusing Derm-Unlabeled dataset.For the CheXpert dataset, we pretrain with learning ratein { } , temperature in { } , and batchsize in { } , and we select the model with best per-formance on the down-stream validation set. We also testeda range of possible augmentations and observe that the aug-mentations that lead to the best performance on the vali- dation set for this task are random cropping, random colorjittering (strength=0.5), rotation (upto 20 degrees) and hori-zontal ﬂipping. Unlike the original set of proposed augmen-tation in SimCLR, we do not use the Gaussian blur, becausewe think it can make it impossible to distinguish local tex-ture variations and other areas of interest thereby changingthe underlying disease interpretation the X-ray image. Weleave comprehensive investigation of the optimal augmen-tations to future work. Our best model on CheXpert waspretrained with batch size 1024, and learning rate of 0.5and we pretrain the models up to 100,000 steps.We perform MICLe pretraining only on the dermatol-ogy unlabeled dataset as we did not have enough caseswith the presence of multiple views in the CheXpert datasetto allow comprehensive training and evaluation of this ap-proach. For MICLe pretraining we initialize our model us-ing SimCLR pretrained weights, and then incorporate themulti-instance procedure as explained in Section 3.2 to fur-ther learn a more comprehensive representation using multi-instance data. Due to memory limits caused by stacking upto 6 images per patient case, we train with a smaller batchsize of 128 and learning rate of 0.1 for 100,000 steps to sta-bilize the training. Decreasing the learning rate for smallerbatch size has been suggested in [7]. The rest of the set-tings, including optimizer, weight decay, and warmup stepare the same as our previous pretraining protocol.In all of our pretraining experiments, images are resizedto 224 × ∼

12 hours to pretrain a ResNet-50 (1 × ) with batchsize 512 and for 100 epochs. Additional details about theselection of batch size and learning rate, and augmentationsare provided in the Appendix B. We train the model end-to-end during ﬁne-tuning usingthe weights of the pretrained network as initialization forthe downstream supervised task dataset following the ap-proach described by Chen et al. [7, 8] for all our experi-ments. We trained for 30,000 steps with a batch size of256 using SGD with a momentum parameter of 0.9. Fordata augmentation during ﬁne-tuning, we performed ran-dom color augmentation, crops with resize, blurring, rota-tion, and ﬂips for the images in both tasks. We observethat this set of augmentations is critical for achieving thebest performance during ﬁne-tuning. We resize the Dermdataset images to 448 ×

448 pixels and CheXpert images to224 ×

224 during this ﬁne-tuning stage.For every combination of pretraining strategy and down-stream ﬁne-tuning task, we perform an extensive hyper-parameter search. We selected the learning rate and weightdecay after a grid search of seven logarithmically spacedlearning rates between 10 − . and 10 − . and three loga-5ithmically spaced values of weight decay between 10 − and 10 − , as well as no weight decay. For training from thesupervised pretraining baseline we follow the same proto-col and observe that for all ﬁne-tuning setups, 30,000 stepsis sufﬁcient to achieve optimal performance. For supervisedbaselines we compare against the identical publicly avail-able ResNet models pretrained on ImageNet with standardcross-entropy loss. These models are trained with the samedata augmentation as self-supervised models (crops, strongcolor augmentation, and blur). After identifying the best hyperparameters for ﬁne-tuning a given task/dataset, we proceed to select the modelbased on validation set performance and evaluate the chosenmodel multiple times (10 times for chest X-ray task and 5times for the dermatology task) on the test set to report taskperformance. Our primary metrics for the dermatology taskare top-1 accuracy and Area Under the Curve (AUC) fol-lowing [28]. For the chest X-ray task, given the multi-labelsetup, we report mean AUC averaged between the predic-tions for the ﬁve target pathologies following [23]. Addi-tional detail about the model selection, evaluation, and sta-tistical signiﬁcant test are provided in Appendix B.1.1.

5. Experiments & Results

In this section we investigate whether self-supervisedpretraining with contrastive learning translates to a betterperformance in models ﬁne-tuned end-to-end across the se-lected medical image classiﬁcation tasks. To this end, ﬁrst,we explore the choice of the pretraining dataset for med-ical imaging tasks. Then, we evaluate the beneﬁts of ourproposed multi-instance contrastive learning (MICLe) fordermatology condition classiﬁcation task, and compare andcontrast the proposed method against the baselines and stateof the art methods for supervised pretraining. Finally, weexplore label efﬁciency and transferability (under distribu-tion shift) of self-supervised trained models in the medicalimage classiﬁcation setting.

One important aspect of transfer learning via self-supervised pretraining is the choice of a proper unlabeleddataset. For this study, we use architectures of varying ca-pacities (i.e ResNet-50 (1 × ), ResNet-50 (4 × ) and ResNet-152 (2 × )) as our base network, and carefully investigatethree possible scenario for self-supervised pretraining inthe medical context: (1) using ImageNet dataset only ,(2) using the task speciﬁc unlabeled medical dataset (i.e.Derm and CheXpert), and (3) initializing the pretraining https://github.com/google-research/simclr from ImageNet self-supervised model but using task spe-ciﬁc unlabeled dataset for pretraining, here indicated as Im-ageNet → CheXpert and ImageNet → CheXpert. Table 1shows the performance of dermatology skin condition andchest X-ray classiﬁcation model measured by top-1 accu-racy (%) and area under the curve (AUC) across differentarchitectures and pretraining scenarios. Our results suggestthat, best performance are achieved when both ImageNetand task speciﬁc unlabeled data are used. Combining Im-ageNet and Derm unlabeled data for pretraining, translatesto (1 . ± . increase in top-1 accuracy for derma-tology classiﬁcation over only using ImageNet dataset forself-supervised transfer learning. This results suggests thatpretraining on ImageNet is likely complementary to pre-training on unlabeled medical images. Moreover, we ob-serve that larger models are able to beneﬁt much more fromself-supervised pretraining underscoring the importance ofmodel capacity in this setting.As shown in Table 1, on CheXpert, we once again ob-serve that self-supervised pretraining with both ImageNetand in-domain CheXpert data is beneﬁcial, outperformingself-supervised pretraining on ImageNet or CheXpert alone. Next, we evaluate whether utilizing multi-instance con-trastive learning (MICLe) and leveraging the potential avail-ability of multiple images per patient for a given pathology,is beneﬁcial for self-supervised pretraining. Table 2 com-pares the performance of dermatology condition classiﬁca-tion models ﬁne-tuned on representations learned with andwithout MICLe pretraining. We observe that MICLe con-sistently improves the performance of dermatology classi-ﬁcation over the original SimCLR method under differentpretraining dataset and base network architecture choices.Using MICLe for pretraining, translates to (1 . ± . increase in top-1 accuracy for dermatology classiﬁcationover using only original SimCLR. We further improves the performance by providing morenegative examples with training longer for 1000 epochs anda larger batch size of 1024. We achieve the best-performingtop-1 accuracy of (70 . ± . % using the ResNet-152(2 × ) architecture and MICLe pretraining by incorporatingboth ImageNet and Derm dataset in dermatology condi-tion classiﬁcation. Tables 3 and 4 show the comparison oftransfer learning performance of SimCLR and MICLe mod-els with supervised baselines for the dermatology and thechest X-ray classiﬁcation. This result shows that after ﬁne-tuning, our self-supervised model signiﬁcantly outperformsthe supervised baseline when ImageNet pretraining is used( p < . ). We speciﬁcally observe an improvement ofover 6.7% in top-1 accuracy in the dermatology task when6able 1: Performance of dermatology skin condition and Chest X-ray classiﬁcation model measured by top-1 accuracy (%) and area underthe curve (AUC) across different architectures. Each model is ﬁne-tuned using transfer learning from pretrained model on ImageNet, onlyunlabeled medical data, or pretrained using medical data initialized from ImageNet pretrained model (e.g. ImageNet → Derm). Biggermodels yield better performance. pretraining on ImageNet is complementary to pretraining on unlabeled medical images.Dermatology Classiﬁcation Chest X-ray ClassifcationArchitecture Pretraining Dataset Top-1 Accuracy (%) AUC Pretraining Dataset Mean AUCResNet-50 (1 × ) ImageNet 62.58 ± ± ± ± ± ± → Derm 63.44 ± ± → CheXpert 0.7670 ± × ) ImageNet 64.62 ± ± ± ± ± ± → Derm 67.63 ± ± → CheXpert 0.7687 ± × ) ImageNet 66.38 ± ± ± ± ± ± → Derm 68.30 ± ± → CheXpert 0.7689 ± Table 2:

Evaluation of multi instance contrastive learning (MI-CLe) on Dermatology condition classiﬁcation. Our results suggestthat MICLe consistently improves the accuracy of skin conditionclassiﬁcation over SimCLR on different datasets and architectures.Model Dataset MICLe Top-1 AccuracyDerm No 66.93 ± ± × ) ImageNet → Derm No 67.63 ± → Derm

Yes 68.81 ± ± ± × ) ImageNet → Derm No 68.30 ± → Derm

Yes 68.43 ± using MICLe. On the chest X-ray task, the improvement is1.1% in mean AUC without using MICLe.Though using ImageNet pretrained models is still thenorm, recent advances have been made by supervised pre-training on large scale (often noisy) natural datasets [24,29] improving transfer performance on downstream tasks.We therefore also evaluate a supervised baseline fromKolesnikov et al. [24], a ResNet-101 (3 × ) pretrained onImageNet21-k called Big Transfer (BiT). This model con-tains additional architectural tweaks included to boost trans-fer performance, and was trained on a signiﬁcantly largerdataset (14M images labelled with one or more of 21kclasses, v.s. the 1M images in ImageNet) which providesus with a strong supervised baseline . ResNet-101 (3 × ) has382M trainable parameters, thus comparable to ResNet-152(2 × ) with 233M trainable parameters. We observe that theMICLe model is better than this BiT model for the derma-tology classiﬁcation task improving by 1.6% in top-1 ac- This model is also available publicly at https://github.com/google-research/big_transfer

Table 3:

Comparison of best self-supervised models v.s. super-vised pretraining baselines on dermatology classiﬁcation.Architecture Method Pretraining Dataset Top-1 AccuracyResNet-152 (2 × ) Supervised ImageNet 63.36 ± × ) BiT [24] ImageNet-21k 68.45 ± × ) SimCLR ImageNet 66.38 ± × ) SimCLR ImageNet → Derm 69.43 ± × ) MICLe ImageNet → Derm ± Table 4:

Comparison of best self-supervised models v.s. super-vised pretraining baselines on chest X-ray classiﬁcation.Architecture Method Pretraining Dataset Mean AUCResNet-152 (2 × ) Supervised ImageNet 0.7625 ± × ) BiT [24] ImageNet-21k 0.7720 ± × ) SimCLR ImageNet 0.7671 ± × ) SimCLR CheXpert 0.7702 ± × ) SimCLR ImageNet → CheXpert ± curacy. For the chest X-ray task, self supervised model isbetter by about 0.1% mean AUC. We surmise that with addi-tional in-domain unlabeled data (we only use the CheXpertdataset for pretraining), self-supervised pretraining can sur-pass the BiT baseline by a larger margin. At the same time,these two approaches are complementary but we leave fur-ther explorations in this direction to future work. We conduct further experiments to evaluate the robust-ness self-supervised pretrained models to distribution shifts.For this purpose, we use the model post pretraining and end-to-end ﬁne-tuning (i.e. CheXpert and Derm) to make pre-dictions on an additional shifted dataset without any furtherﬁne-tuning (zero-shot transfer learning). We use the Derm-7 .20 0.25 0.30 0.35

Top-1 Accuracy R e s - x R e s - x A r c h i t e c t u r e MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR ImageNetSupervised ImageNet

Mean AUC R e s - x R e s - x A r c h i t e c t u r e SimCLR ImageNet+CheXpertSimCLR CheXpertSimCLR ImageNetSupervised ImageNet

Figure 4:

Evaluation of models on distribution-shifted datasets(top: Derm → Derm-External; bottom: CheXpert → NIH chest X-ray) shows that self-supervised training using both ImageNet andthe target domain signiﬁcantly improves robustness to distribu-tion shift.

External and NIH chest X-ray as our target shifted datasets.Our results generally suggest that self-supervised pretrainedmodels can generalize better to distribution shifts.For the chest X-ray task, we note that self-supervisedpretraining with either ImageNet or CheXpert data im-proves generalisation, but stacking them both yields furthergains. We also note that when only using ImageNet for selfsupervised pretraining, the model performs worse in thissetting compared to using in-domain data for pretraining.Further we ﬁnd that the performance improvement in thedistribution-shifted dataset due to self-supervised pretrain-ing (both using ImageNet and CheXpert data) is more pro-nounced than the original improvement on the CheXpertdataset. This is a very valuable ﬁnding, as generalisationunder distribution shift is of paramount importance to clini-cal applications. On the dermatology task, we observe sim-ilar trends suggesting the robustness of the self-supervisedrepresentations is consistent across tasks.

To investigate label-efﬁciency of the selected self-supervised models, following the previously explained ﬁne-tuning protocol, we ﬁne-tune our models on different frac-tions of labeled training data. We also conduct baseline ﬁne-tuning experiments with supervised ImageNet pretrainedmodels. We use the label fractions ranging from 10% to90% for both Derm and CheXpert training datasets. Fine-tuning experiments on label fractions are repeated multi-ple times using the best parameters and averaged. Figure 4

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

Figure 5:

Top-1 accuracy for dermatology condition classiﬁca-tion for MICLe, SimCLR, and supervised models under differentunlabeled pretraining dataset and varied sizes of label fractions. shows how the performance varies using the different avail-able label fractions for the dermatology task. First, we ob-serve that pretraining using self-supervised models can sig-niﬁcantly help with label efﬁciency for medical image clas-siﬁcation, and in all of the fractions, self-supervised modelsoutperform the supervised baseline. Moreover, these resultssuggest that MICLe yields proportionally larger gains whenﬁne-tuning with fewer labeled examples. In fact, MICLeis able to match baseline using only 20% of the trainingdata for ResNet-50 (4 × ) and 30% of the training data forResNet-152 (2 × ). Results on CheXpert dataset are includedin Appendix B.2 where we observe similar but less strikingtrends.

6. Conclusion

Supervised pretraining on natural image datasets suchas ImageNet is commonly used to improve medical imageclassiﬁcation. This paper investigates an alternative strategybased on self-supervised pretraining on unlabeled naturaland medical images and ﬁnds that self-supervised pretrain-ing signiﬁcantly outperforms supervised pretraining. Thepaper proposes the use of multiple images per medical caseto enhance data augmentation for self-supervised learning,which boosts the performance of image classiﬁers even fur-ther. Self-supervised pretraining is much more scalable than8upervised pretraining since class label annotation is not re-quired. A natural next step for this line of research is to in-vestigate the limit of self-supervised pretraining by consid-ering massive unlabeled medical image datasets. Anotherresearch direction concerns the transfer of self-supervisedlearning from one imaging modality and task to another.We hope this paper will help popularize the use of self-supervised approaches in medical image analysis yieldingimprovements in label efﬁciency across the medical ﬁeld.

Acknowledgement

We would like to thank Yuan Liu for valuable feedbackon the manuscript. We are also grateful to Jim Winkens,Megan Wilson, Umesh Telang, Patricia Macwilliams, GregCorrado, Dale Webster, and our collaborators at DermPathAI for their support of this work.

References [1] Laith Alzubaidi, Mohammed A Fadhel, Omran Al-Shamma,Jinglan Zhang, J Santamar´ıa, Ye Duan, and Sameer ROleiwi. Towards a better understanding of transfer learn-ing for medical imaging: a case study.

Applied Sciences ,10(13):4523, 2020.[2] Philip Bachman, R Devon Hjelm, and William Buchwalter.Learning representations by maximizing mutual informationacross views. In

Advances in Neural Information ProcessingSystems , pages 15535–15545, 2019.[3] Wenjia Bai, Chen Chen, Giacomo Tarroni, Jinming Duan,Florian Guitton, Steffen E Petersen, Yike Guo, Paul MMatthews, and Daniel Rueckert. Self-supervised learning forcardiac MR image segmentation by anatomical position pre-diction. In

International Conference on Medical Image Com-puting and Computer-Assisted Intervention , pages 541–549.Springer, 2019.[4] Suzanna Becker and Geoffrey E Hinton. Self-organizingneural network that discovers surfaces in random-dot stere-ograms.

Nature , 355(6356):161–163, 1992.[5] Liang Chen, Paul Bentley, Kensaku Mori, Kazunari Misawa,Michitaka Fujiwara, and Daniel Rueckert. Self-supervisedlearning for medical image analysis using image contextrestoration.

Medical image analysis , 58:101539, 2019.[6] Sihong Chen, Kai Ma, and Yefeng Zheng. Med3d: Transferlearning for 3D medical image analysis, 2019.[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-offrey Hinton. A simple framework for contrastive learningof visual representations. arXiv preprint arXiv:2002.05709 ,2020.[8] Ting Chen, Simon Kornblith, Kevin Swersky, MohammadNorouzi, and Geoffrey Hinton. Big self-supervised mod-els are strong semi-supervised learners. arXiv preprintarXiv:2006.10029 , 2020.[9] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 , 2020. [10] Veronika Cheplygina, Marleen de Bruijne, and Josien PWPluim. Not-so-supervised: a survey of semi-supervised,multi-instance, and transfer learning in medical image anal-ysis.

Medical image analysis , 54:280–296, 2019.[11] Ozan Ciga, Anne L Martel, and Tony Xu. Self supervisedcontrastive learning for digital histopathology. arXiv preprintarXiv:2011.13971 , 2020.[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised visual representation learning by context prediction. In

Proceedings of the IEEE international conference on com-puter vision , pages 1422–1430, 2015.[13] Robin Geyer, Luca Corinzia, and Viktor Wegmayr. Transferlearning by adaptive merging of multiple models. In M. JorgeCardoso, Aasa Feragen, Ben Glocker, Ender Konukoglu,Ipek Oguz, Gozde Unal, and Tom Vercauteren, editors,

Pro-ceedings of Machine Learning Research . PMLR, 2019.[14] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image rota-tions. arXiv preprint arXiv:1803.07728 , 2018.[15] Mara Graziani, Vincent Andrearczyk, and Henning M¨uller.Visualizing and interpreting feature reuse of pretrained cnnsfor histopathology. In

MVIP 2019: Irish Machine Visionand Image Processing Conference Proceedings . Irish PatternRecognition and Classiﬁcation Society, 2019.[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual repre-sentation learning. arXiv preprint arXiv:1911.05722 , 2019.[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages9729–9738, 2020.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[19] Xuehai He, Xingyi Yang, Shanghang Zhang, Jinyu Zhao,Yichen Zhang, Eric Xing, and Pengtao Xie. Sample-efﬁcientdeep learning for COVID-19 diagnosis based on CT scans. medRxiv , 2020.[20] Michal Heker and Hayit Greenspan. Joint liver lesion seg-mentation and classiﬁcation via transfer learning. arXivpreprint arXiv:2004.12352 , 2020.[21] Olivier J H´enaff, Aravind Srinivas, Jeffrey De Fauw, AliRazavi, Carl Doersch, SM Eslami, and Aaron van den Oord.Data-efﬁcient image recognition with contrastive predictivecoding. arXiv preprint arXiv:1905.09272 , 2019.[22] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,Karan Grewal, Phil Bachman, Adam Trischler, and YoshuaBengio. Learning deep representations by mutual informa-tion estimation and maximization. 2019.[23] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil-viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, BehzadHaghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert:A large chest radiograph dataset with uncertainty labels andexpert comparison. In

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , volume 33, pages 590–597, 2019.

24] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, JoanPuigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.Big transfer (BiT): General visual representation learning. arXiv preprint arXiv:1912.11370 , 6, 2019.[25] Gaobo Liang and Lixin Zheng. A transfer learning methodwith deep residual network for pediatric pneumonia diag-nosis.

Computer methods and programs in biomedicine ,187:104964, 2020.[26] Jingyu Liu, Gangming Zhao, Yu Fei, Ming Zhang, YizhouWang, and Yizhou Yu. Align, attend and locate: Chest x-raydiagnosis via contrast induced attention network with lim-ited supervision. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 10632–10641, 2019.[27] Quande Liu, Lequan Yu, Luyang Luo, Qi Dou, andPheng Ann Heng. Semi-supervised medical image classi-ﬁcation with relation-driven self-ensembling model.

IEEETransactions on Medical Imaging , 2020.[28] Yuan Liu, Ayush Jain, Clara Eng, David H Way, Kang Lee,Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Mar-inho, Jessica Gallegos, Sara Gabriele, et al. A deep learningsystem for differential diagnosis of skin diseases.

NatureMedicine , pages 1–9, 2020.[29] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,and Laurens van der Maaten. Exploring the limits of weaklysupervised pretraining. In

Proceedings of the European Con-ference on Computer Vision (ECCV) , pages 181–196, 2018.[30] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole,Jonathan Godwin, Natasha Antropova, Hutan Ashraﬁan,Trevor Back, Mary Chesus, Greg C Corrado, Ara Darzi, et al.International evaluation of an AI system for breast cancerscreening.

Nature , 577(7788):89–94, 2020.[31] Afonso Menegola, Michel Fornaciali, Ramon Pires,Fl´avia Vasques Bittencourt, Sandra Avila, and EduardoValle. Knowledge transfer for melanoma screening withdeep learning. In , pages 297–300. IEEE,2017.[32] Ishan Misra and Laurens van der Maaten. Self-supervisedlearning of pretext-invariant representations. In

Proceedingsof the IEEE/CVF Conference on Computer Vision and Pat-tern Recognition , pages 6707–6717, 2020.[33] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang.What is being transferred in transfer learning?

Advancesin Neural Information Processing Systems , 33, 2020.[34] Mehdi Noroozi and Paolo Favaro. Unsupervised learningof visual representations by solving jigsaw puzzles. In

European Conference on Computer Vision , pages 69–84.Springer, 2016.[35] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-sentation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 , 2018.[36] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and SamyBengio. Transfusion: Understanding transfer learning formedical imaging. In

Advances in neural information pro-cessing systems , pages 3347–3357, 2019.[37] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, JasmineHsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learn-ing from video. In , pages 1134–1141. IEEE,2018.[38] Hari Sowrirajan, Jingbo Yang, Andrew Y Ng, and PranavRajpurkar. Moco pretraining improves representation andtransferability of chest X-ray models. arXiv preprintarXiv:2010.05352 , 2020.[39] Hannah Spitzer, Kai Kiwitz, Katrin Amunts, Stefan Harmel-ing, and Timo Dickscheid. Improving cytoarchitectonic seg-mentation of human brain areas with self-supervised siamesenetworks. In

International Conference on Medical ImageComputing and Computer-Assisted Intervention , pages 663–671. Springer, 2018.[40] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-trastive multiview coding. arXiv preprint arXiv:1906.05849 ,2019.[41] Michael Tschannen, Josip Djolonga, Marvin Ritter, Ar-avindh Mahendran, Neil Houlsby, Sylvain Gelly, and MarioLucic. Self-supervised learning of video-induced visual in-variances. In . IEEE Computer So-ciety, 2020.[42] Dong Wang, Yuan Zhang, Kexin Zhang, and Liwei Wang.Focalmix: Semi-supervised learning for 3d medical imagedetection. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 3951–3960, 2020.[43] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-hammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks onweakly-supervised classiﬁcation and localization of commonthorax diseases. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 2097–2106,2017.[44] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.Unsupervised feature learning via non-parametric instancediscrimination. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3733–3742, 2018.[45] Huidong Xie, Hongming Shan, Wenxiang Cong, XiaohuaZhang, Shaohua Liu, Ruola Ning, and Ge Wang. Dual net-work architecture for few-view CT-trained on imagenet dataand transferred for medical imaging. In

Developments inX-Ray Tomography XII , volume 11113, page 111130V. In-ternational Society for Optics and Photonics, 2019.[46] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Un-supervised embedding learning via invariant and spreadinginstance feature. In

Proceedings of the IEEE Conference oncomputer vision and pattern recognition , pages 6210–6219,2019.[47] Yang You, Igor Gitman, and Boris Ginsburg. Largebatch training of convolutional networks. arXiv preprintarXiv:1708.03888 , 2017.[48] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In

European conference on computervision , pages 649–666. Springer, 2016.

49] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher DManning, and Curtis P Langlotz. Contrastive learning ofmedical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 , 2020.[50] Hong-Yu Zhou, Shuang Yu, Cheng Bian, Yifan Hu, Kai Ma,and Yefeng Zheng. Comparing to learn: Surpassing ima-genet pretraining on radiographs by comparing image repre-sentations. In

International Conference on Medical ImageComputing and Computer-Assisted Intervention , pages 398–407. Springer, 2020. [51] Jiuwen Zhu, Yuexiang Li, Yifan Hu, Kai Ma, S Kevin Zhou,and Yefeng Zheng. Rubik’s cube+: A self-supervised featurelearning framework for 3D medical image analysis.

MedicalImage Analysis , page 101746, 2020.[52] Xinrui Zhuang, Yuexiang Li, Yifan Hu, Kai Ma, Yujiu Yang,and Yefeng Zheng. Self-supervised feature learning for 3Dmedical images by playing a rubik’s cube. In

InternationalConference on Medical Image Computing and Computer-Assisted Intervention , pages 420–428. Springer, 2019. . Datasets A.1. Dermatology

Dermatology dataset details.

As in actual clinical settings, the distribution of different skin conditions is heavily skewedin the Derm dataset, ranging from some skin conditions making up more than 10% of the training data like acne, eczema,and psoriasis, to those making up less than 1% like lentigo, melanoma, and stasis dermatitis [28]. To ensure that there wassufﬁcient data to develop and evaluate the Dermatology skin condition classiﬁer, we ﬁltered the 419 conditions to the top 26with the highest prevalence based on the training set. Speciﬁcally, this ensured that for each of these conditions, there were atleast 100 cases in the training dataset. The remaining conditions were aggregated into an “Other” category (which comprised21% of the cases in test dataset). The 26 target skin conditions are as follow: Acne, Actinic keratosis, Allergic contactdermatitis, Alopecia areata, Androgenetic alopecia, Basal cell carcinoma, Cyst, Eczema, Folliculitis, Hidradenitis, Lentigo,Melanocytic nevus, Melanoma, Post inﬂammatory hyperpigmentation, Psoriasis, Squamous cell carcinoma/squamous cellcarcinoma insitu (SCC / SCCIS), Seborrheic keratosis, Scar condition, Seborrheic dermatitis, Skin tag, Stasis dermatitis,Tinea, Tinea versicolor, Urticaria, Verruca vulgaris, Vitiligo.Figure A.1 shows examples of images in the Derm dataset. Figure A.2 shows examples of images belonging to the samepatient which are taken from different viewpoints and/or from different body-parts under different lighting conditions. In theMulti Instance Contrastive Learning (MICLe) method, when multiple images of a medical condition from a given patient areavailable, we use two randomly selected images from all of the images that belong to this patient to directly create a positivepair of examples for contrastive learning.Figure A.1:

Examples images from Derm dataset. Derm dataset includes 26 classes, ranging from skin conditions with greater than 10%prevalence like acne, eczema, and psoriasis, to those with sub-1% prevalence like lentigo, melanoma, and stasis dermatitis.

Derm-External dataset details.

The dataset used for evaluating the out-of-distribution generalization performance of themodel on the dermatology task was collected by a chain of skin cancer clinics in Australia and New Zealand. When comparedto the in-distribution dermatology dataset, this dataset has a much higher prevalence of skin cancers such as Melanoma, BasalCell Carcinoma, and Actinic Keratosis. It includes 8,563 de-identiﬁed multi-image cases which we use for the purpose ofevaluating the generalization of the model under distribution shift.

A.2. CheXpert

Dataset split details.

For CheXpert dataset [23] and the task of chest X-ray interpretation, we set up the learning taskto diagnose ﬁve different thoracic pathologies: atelectasis, cardiomegaly, consolidation, edema and pleural effusion. TheCheXpert dataset default split contains a training set of more than 200k images and a very small validation set that containsonly 200 images. This extreme size difference is mainly because the training set is constructed using an algorithmic labelerbased on the free text radiology reports while the validation set is manually labeled by board-certiﬁed radiologists. Similarto Neyshabur et al. [33, 36] ﬁndings, we realized due to the small size of the validation set, and the discrepancy betweenthe label collection of the training set and the validation set, the high variance in studies is plausible. This variance impliesthat high performance on subsets of the training set would not correlate well with performance on the validation set, andconsequently, complicating model selection from the hyper-parameter sweep. Following Neyshabur et al. [33] suggestion,12igure A.2:

Examples of images belong to the same patient which are taken from different viewpoints and/or from different body-partsunder different lighting conditions. Each category, marked with a dashed line, belongs to a single patient and represents a single medicalcondition. In MICLe, when multiple images of a medical condition from the same patient are available, we use two randomly selectedimages from the patient to directly create a positive pair of examples and later adopt the augmentation. When a single image of a medicalcondition is available, we use standard data augmentation to generate two augmented views of the same image. in order to facilitate a robust comparison of our method to standard approaches, we deﬁne a custom subset of the trainingdata as the validation set where we randomly re-split the full training set into 67,429 training images, 22,240 validation and33,745 test images, respectively. This means the performances of our models are not compatible to those reported in [23]and the corresponding competition leader-board for this speciﬁc dataset; nonetheless, we believe the relative performance ofmodels is representative, informative, and comparable with [33, 36]. Figure A.3 shows examples of images in the CheXpertdataset which includes both frontal and lateral radiographs. CheXpert data augmentation.

Due to the less versatile nature of CheXpert dataset (see Fig. A.3), we used fairly strongdata augmentation in order to prevent overﬁtting and improve ﬁnal performance. At training time, the following preprocess-ing was applied: (1) random rotation by angle δ ∼ U ( − , degree, (2) random crop to 224 ×

224 pixels, (3) randomleft-right ﬂip with probability 50%, (4) linearly rescale value range from [0, 255] to [0, 1] followed by random addi-tive brightness modulation and random multiplicative contrast modulation. Random additive brightness modulation addsa δ ∼ U ( − . , . to all channels. Random multiplicative contrast modulation multiplies per-channel standard deviation bya factor s ∼ U ( − . , . . After these steps we re-clip values to the range of [0, 1].Figure A.3: Examples images from CheXpert dataset. The chest x-rays images are less diverse in comparison to the ImageNet and Dermdataset examples. The CheXpert task is to predict the probability of different observations from multi-view chest radiographs where weare looking for small local variations in examples using frontal and lateral radiographs. https://stanfordmlgroup.github.io/competitions/chexpert/ . Additional Results and Experiments B.1. Dermatology Classiﬁcation

B.1.1 Evaluation Details and Statistical Signiﬁcance Testing

To evaluate the dermatology condition classiﬁcation model performance, we compared its predicted differential diagnosiswith the majority voted reference standard differential diagnosis (ground-truth label) using the top-k accuracy and the averagetop-k sensitivity. The top-k accuracy measures how frequently the top k predictions match any of the primary diagnoses inthe ground truth. The top-k sensitivity measures this for each of the 26 conditions separately, whereas the ﬁnal average top-ksensitivity is the average across the 26 conditions. Averaging across the 26 conditions avoids biasing towards more commonconditions. We use both the top-1 and top-3 metrics in this paper.In addition to our previous result comparing MICLe and SimCLR models against the supervised baselines, the non-parametric bootstrap is used to estimate the variability around model performance and investigating any signiﬁcant improve-ment in the results using self-supervised pretrained models. Unlike the previous studies which uses conﬁdence intervalsobtained by multiple separate runs, for statistical signiﬁcance testing, we select the best ﬁne-tuned models for each of thearchitectures and compute the difference in top-1 and top-3 accuracies on bootstrap replicas of the test set. Given predictionsof two models, we generate 1,000 bootstrap replicates of the test set and computing the difference in the target performancemetric (top-k accuracy and AUCs) for both models after performing this randomization. This produces a distribution for eachmodel and we use the 95% bootstrap percentile intervals to assess signiﬁcance at the p = 0 . level.Table B.1 shows the comparison of the best self-supervised models v.s. supervised pretraining on dermatology classiﬁ-cation. Our results suggest that, MICLe models can signiﬁcantly ( p < . ) outperform SimCLR counterpart and BiT [24]supervised model with ResNet-101 (3 × ) architecture over top-1 and top-3 accuracies. BiT model contains additional ar-chitectural tweaks included to boost transfer performance, and was trained on a signiﬁcantly larger dataset of 14M imageslabelled with one or more of 21k classes which provides us with a strong supervised baseline v.s. the 1M images in ImageNet.Table B.1: Comparison of the best self-supervised models v.s. supervised pretraining on dermatology classiﬁcation. For the signiﬁcancetesting, we use bootstrapping to generate the conﬁdence intervals. Our results show that the best MICLe model can signiﬁcantly outperformBiT [24] which is a very strong supervised pretraining baseline trained on ImageNet-21k.Architecture Method Top-1 Accuracy Top-3 AccuracyResNet-152 (2 × ) MICLe ImageNet → Derm (ours) 0.7037 ± ± → Derm [7] 0.6970 ± ± × ) MICLe ImageNet → Derm (ours) 0.7019 ± ± → Derm [7] 0.6975 ± ± × ) BiT Supervised [24] 0.6845 ± ± B.1.2 Augmentation Selection for Multi-Instance Contrastive (MICLe) Method

To systematically study the impact of data augmentation in our multi-instance contrastive learning framework performance,we consider two augmentation scenarios: (1) performing standard simCLR augmentation which includes random color aug-mentation, crops with resize, Gaussian blur, and random ﬂips, (2) performing a partial and lightweight augmentation based onrandom cropping and relying only on pair selections steps to create positive pairs. To understand the importance of augmen-tation composition in MICLe, we pretrain models under different augmentation and investigate the performance of ﬁne-tunedmodels for the dermatology classiﬁcation task. As the results in Table B.2 suggest, MICLe under partial augmentation oftenoutperform the full augmentation, however, the difference is not signiﬁcant. We leave comprehensive investigation of theoptimal augmentations to future work.

B.1.3 Beneﬁts of Longer Training

Figure B.4 shows the impact of longer training when models are pretrained for different numbers of epochs/steps. As sug-gested by Chen et al. [5, 8] training longer also provides more negative examples, improving the results. In this study weuse a ﬁxed batch size of 1024 and we ﬁnd that with more training epochs/steps, the gaps between the performance of Ima-14able B.2:

Comparison of dermatology classiﬁcation performance ﬁne-tuned on representation learned using different unlabeled datasetwith MICLe along with standard augmentation and partial augmentation. Our results suggest that MICLe under partial augmentation oftenoutperform the full augmentation.

Architecture Method Augmentation Top-1 Accuracy Top-1 Sensitivity AUCResNet-152 (2 × ) MICLe Derm Full Augmentation 0.6697 0.5060 0.9562Partial Augmentation → Derm Full Augmentation 0.6928 0.5136 0.9634Partial Augmentation 0.6889 0.5300 0.9620ResNet-50 (4 × ) MICLe Derm Full Augmentation 0.6803 0.5032 0.9608Partial Augmentation → Derm Full Augmentation 0.6916 0.5159 0.9618Partial Augmentation

Res50-1x Res50-4x Res152-2x

Top-1 Accuracy A r c h i t e c t u r e SimCLR ImageNet+Derm

Res50-1x Res50-4x Res152-2x

Top-1 Accuracy A r c h i t e c t u r e SimCLR Derm

Res50-1x Res50-4x Res152-2x

AUC A r c h i t e c t u r e SimCLR ImageNet+Derm

Res50-1x Res50-4x Res152-2x

AUC A r c h i t e c t u r e SimCLR Derm

Figure B.4:

Performance of dermatology condition classiﬁcation models measured by the top-1 accuracy across different architecture andpretrained for 150,000 steps to 450,000 steps with a ﬁxed batch size of 1024. Training longer provides more negative examples, improvingthe performance. Also, the results suggest that ImageNet initialization facilitating convergence, however, the performance gap betweenImageNet initialized models and medical image only models are getting narrower.

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-50 (4x)

SimCLR: 450K stepsSimCLR: 150K stepsSupervised ImageNet

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-152 (2x)

SimCLR: 450K stepsSimCLR: 150K stepsSupervised ImageNet

Figure B.5:

Label efﬁciency progress over longer training for dermatology condition classiﬁcation. The models are trained usingImageNet → Derm SimCLR for 150K steps and 450K steps and ﬁne-tuned with varied sizes of label fractions. The Supervised ImageNetused as the baseline. geNet initialized models with medical image only models are getting narrow, suggesting ImageNet initialization facilitatingconvergence where by taking fewer steps we can reach a given accuracy faster.Furthermore, Fig. B.5 shows how the performance varies using the different available label fractions for dermatologytask for the models pretrained for 150K steps and 450,000 steps using SimCLR ImageNet → Derm dataset. These resultssuggest that longer training yields proportionally larger gain for different label fractions. Also, this performance gain is morepronounced in ResNet-152 (2 × ). In fact, for ResNet-152 (2 × ) longer self supervised pretraining enable the model to matchbaseline using less than 20% of the training data v.s.

30% of the training data for 150,000 steps of pretraining.15 .1.4 Detailed Performance Results

Table B.3 shows additional results for the performance of dermatology condition classiﬁcation model measured by top-1and top-3 accuracy, and area under the curve (AUC) across different architectures. Each model is ﬁne-tuned using transferlearning from pretrained model on ImageNet, only unlabeled medical data, or pretrained using medical data initialized fromImageNet pretrained model. Again, we observe that bigger models yield better performance across accuracy, sensitivity andAUC for this task.As shown in Table B.3, we once again observe that self-supervised pretraining with both ImageNet and in-domain Dermdata is beneﬁcial, outperforming self-supervised pretraining on ImageNet or Derm data alone. Moreover, comparing theperformance of self-supervised models with Random and Supervised pretraining baseline, we observe self-supervised modelssigniﬁcantly outperforms baselines ( p < . ), even using smaller models such as ResNet-50 (1 × ).Table B.4 shows additional dermatology condition classiﬁcation performance for models ﬁne-tuned on representationslearned using different unlabeled datasets, and with and without multi instance contrastive learning (MICLe). Our resultssuggest that MICLe constantly improves the performance of skin condition classiﬁcation over SimCLR [5, 8]. Using statisti-cal signiﬁcance test, we observe signiﬁcant improvement for top-1 accuracy using MICLe for each dataset setting ( p < . ).Table B.3: Performance of dermatology condition classiﬁcation models measured by top-1 and top-3 accuracy, and area under the curve(AUC) across different architectures. Models are pretrained for 150K steps and each model is ﬁne-tuned using transfer learning frompretrained model on ImageNet, only unlabeled medical data, or pretrained using medical data initialized from ImageNet pretrained model.We observe that bigger models yield better performance.

Architecture Method Top-1 Accuracy Top-3 Accuracy Top-1 Sensitivity Top-3 Sensitivity AUCResNet-50 (1 × ) SimCLR ImageNet 0.6258 ± ± ± ± ± ± ± ± ± ± → Derm 0.6344 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± × ) SimCLR ImageNet 0.6462 ± ± ± ± ± ± ± ± ± ± → Derm 0.6761 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± × ) SimCLR ImageNet 0.6638 ± ± ± ± ± ± ± ± ± ± → Derm 0.6830 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table B.4:

Dermatology condition classiﬁcation performance measured by top-1 accuracy, top-3 accuracy, and AUC. Models are ﬁne-tuned on representations learned using different unlabeled datasets, and with and without multi instance contrastive learning (MICLe). Ourresults suggest that MICLe constantly improves the accuracy of skin condition classiﬁcation over SimCLR.

Architecture Method Top-1 Accuracy Top-3 Accuracy Top-1 Sensitivity Top-3 Sensitivity AUCResNet-152 (2 × ) MICLe Derm 0.6716 ± ± ± ± ± ± ± ± ± ± → Derm 0.6843 ± ± ± ± ± → Derm 0.6830 ± ± ± ± ± × ) MICLe Derm 0.6755 ± ± ± ± ± ± ± ± ± ± → Derm 0.6881 ± ± ± ± ± → Derm 0.6761 ± ± ± ± ± Label Fraction (%) T o p - A cc u r a c y ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) A U C ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) T o p - A cc u r a c y ResNet-152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

20 40 60 80

Label Fraction (%) A U C ResNet-152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermSimCLR DermSupervised ImageNet

Figure B.6:

The top-1 accuracy, top-3 accuracy, and AUC for dermatology condition classiﬁcation for MICLe, SimCLR, and supervisedmodels under different unlabeled pretraining dataset and varied sizes of label fractions. (top) ResNet-50 (4 × ), (bottom) ResNet-152 (2 × ). B.1.5 Detailed Label Efﬁciency Results

Figure B.6 and Table B.5 provide additional performance results to investigate label-efﬁciency of the selected self-supervisedmodels in the dermatology task. These results, back-up our ﬁnding that the pretraining using self-supervised models cansigniﬁcantly help with label efﬁciency for medical image classiﬁcation, and in all of the fractions, self-supervised modelsoutperform the supervised baseline. Also, we observe that MICLe yields proportionally larger gains when ﬁne-tuning withfewer labeled examples and this is consistent across top-1 and top-3 accuracy and sensitivity, and AUCs for the dermatologyclassiﬁcation task.Table B.5:

Classiﬁcation accuracy and sensitivity for dermatology condition classiﬁcation task, obtained by ﬁne-tuning the SimCLR andMICLe on 10%, 50%, and 90% of the labeled data. As a reference, ResNet-50 (4 × ) ﬁne-tuned the supervised ImageNet model and using100% labels achieves 62.36% top-1 and 88.86% top-3 accuracy. Performance Metric Top-1 Accuracy Top-3 Accuracy Top-1 Sensitivity Top-3 SensitivityArchitecture Method 10% 50% 90% 10% 50% 90% 10% 50% 90% 10% 50% 90%ResNet-152 (2 × ) MICLe ImageNet → Derm 0.5802 0.6542 0.6631 0.8548 0.9037 0.9105 0.3839 0.4795 0.4947 0.6496 0.7567 0.7720SimCLR ImageNet → Derm 0.5439 0.6260 0.6353 0.8339 0.8916 0.9081 0.3446 0.4491 0.4786 0.6243 0.7269 0.7792SimCLR Derm 0.5313 0.6296 0.6522 0.8216 0.8953 0.9034 0.3201 0.4710 0.4906 0.6036 0.7373 0.7557Supervised ImageNet 0.4728 0.5950 0.6191 0.7997 0.8597 0.8845 0.2495 0.4303 0.4677 0.5452 0.7015 0.7326ResNet-50 (4 × ) MICLe ImageNet → Derm 0.5884 0.6498 0.6712 0.8560 0.9076 0.9174 0.3841 0.4878 0.5120 0.6555 0.7554 0.7771SimCLR ImageNet → Derm 0.5748 0.6358 0.6749 0.8523 0.9056 0.9174 0.3983 0.4889 0.5285 0.6585 0.7691 0.7902SimCLR Derm 0.5574 0.6331 0.6483 0.8466 0.8995 0.9142 0.3307 0.4387 0.4675 0.6233 0.7412 0.7728Supervised ImageNet 0.4760 0.5962 0.6174 0.7823 0.8680 0.8909 0.2529 0.4247 0.4677 0.5272 0.6925 0.7379

B.1.6 Subgroup Analysis

In another experiment, we also investigated whether the performance gains when using pretrained representations from self-supervised learning are evenly distributed across different subgroups of interest for the dermatology task; it is importantfor deployment in clinical settings that model performance is similar across such subgroups. We speciﬁcally explore top-117

HITE BEIGE BROWN DARK_BROWN

Skin Type T o p - A cc u r a c y ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermBiT Supervised ImageNetSupervised ImageNet

WHITE BEIGE BROWN DARK_BROWN

Skin Type T o p - A cc u r a c y ResNet152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermBiT Supervised ImageNetSupervised ImageNet

WHITE BEIGE BROWN DARK_BROWN

Skin Type T o p - A cc u r a c y ResNet-50 (4x)

MICLe ImageNet+DermSimCLR ImageNet+DermBiT Supervised ImageNetSupervised ImageNet

WHITE BEIGE BROWN DARK_BROWN

Skin Type T o p - A cc u r a c y ResNet-152 (2x)

MICLe ImageNet+DermSimCLR ImageNet+DermBiT Supervised ImageNetSupervised ImageNet

Figure B.7:

Performance of the different models across different skin type subgroups for the dermatology classiﬁcation task. Modelspretrained using self-supervised learning perform much better on the rare skin type subgroups. and top-3 accuracy across different skin types of white, beige, brown, and dark brown. Figure B.7 shows the distributionof performance across these subgroups. We observe that while the baseline supervised pretrained model performance dropson the rarer skin types, using self-supervised pretraining, the model performance is more even across the different skintypes. This exploratory experiment suggests that the learnt representations are likely general and not picking up any spuriouscorrelations during pretraining.

B.2. Chest X-ray Classiﬁcation

B.2.1 Detailed Performance Results

For the task of X-ray interpretation on the CheXpert dataset, we set up the learning task to detect 5 different pathologies:atelectasis, cardiomegaly, consolidation, edema and pleural effusion. Table B.6 shows the AUC performance on the differentpathologies on the CheXpert dataset. We once again observe that self-supervised pretraining with both ImageNet and in-domain medical data is beneﬁcial, outperforming self-supervised pretraining on ImageNet or CheXpert alone. Also, thedistribution of AUC performance across different pathologies suggests transfer learning, using both self-supervised andsupervised models, provides mixed performance gains on this speciﬁc dataset. These observations are aligned with theﬁndings of [36]. Although less pronounced, once again we observe that bigger models yield better performance.

B.2.2 Detailed Label-efﬁciency Results

Figure B.8 and Fig. B.9 show how the performance changes when using different label fractions for the chest X-ray clas-siﬁcation task. For architecture ResNet-50 (4 × ) self supervised models consistently outperform the supervised baseline,however, this trend is less striking for ResNet-152 (2 × ) models. We also observe that performance improvement in labelefﬁciency is less pronounced for chest X-ray classiﬁcation task in comparison to dermatology classiﬁcation. We believe thatwith additional in-domain unlabeled data (we only use the CheXpert dataset for pretraining), self-supervised pretraining forchest X-ray classiﬁcation improves. 18able B.6: Performances of diagnosing different pathologies on the CheXpert dataset measured with AUC. The distribution of AUCperformance across different pathologies suggests transfer learning, using both self-supervised and supervised models, provides mixedperformance gains on this speciﬁc dataset.

Architecture Method Atelectasis Cardiomgaly Consolidation Edema Pleural EffusionResNet-50 (1 × ) SimCLR ImageNet → CheXpert 0.6561 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± × ) SimCLR ImageNet → CheXpert 0.6679 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± × ) SimCLR ImageNet → CheXpert 0.6666 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±

20 40 60 80

Label Fraction (%) M e a n A U C ResNet-50 (4x)

SimCLR ImageNet+CheXpertSimCLR ChexpertSupervised ImageNet

20 40 60 80

Label Fraction (%) M e a n A U C ResNet-152 (2x)

SimCLR ImageNet+CheXpertSimCLR ChexpertSupervised ImageNet

Figure B.8:

Mean AUC for chest X-ray classiﬁcation using self-supervised, and supervised pretrained models over varied sizes of labelfractions for ResNet-50 (4 × ) and ResNet-152 (2 × ) architecture.

20 40 60 80

Label Fraction (%) M e a n A U C Atelectasis

SimCLR ImageNet+CheXpertSimCLR ChexpertSupervised ImageNet

20 40 60 80

Label Fraction (%) M e a n A U C Consolidation

SimCLR ImageNet+CheXpertSimCLR ChexpertSupervised ImageNet

20 40 60 80

Label Fraction (%) M e a n A U C Edema

SimCLR ImageNet+CheXpertSimCLR ChexpertSupervised ImageNet

Figure B.9:

Performances of diagnosing different pathologies on the CheXpert dataset measured with AUC over varied sizes of labelfractions for ResNet-50 (4 × ).).