[PDF] Self-supervised driven consistency training for annotation efficient histopathology image analysis

Abstract

Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathology. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter and Intra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the number of labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-specific unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlying multi-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsupervised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to effectively transfer the pretrained representations to downstream tasks based on prediction consistency with the task-specific un-labeled data. We carry out extensive validation experiments on three histopathology benchmark datasets across two classification and one regression-based tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification. Under limited-label data, the proposed method yields tangible improvements, which is close or even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empirically show that the idea of bootstrapping the self-supervised pretrained features is an effective way to improve the task-specific semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: this https URL

Full PDF

SSelf-supervised driven consistency training for annotation e ﬃ cienthistopathology image analysis Chetan L. Srinidhi a,b, ∗ , Seung Wook Kim c , Fu-Der Chen d , Anne L. Martel a,b a Physical Sciences, Sunnybrook Research Institute, Toronto, Canada b Department of Medical Biophysics, University of Toronto, Canada c Department of Computer Science, University of Toronto, Canada d Department of Electrical & Computer Engineering, University of Toronto, Canada

Abstract

Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathol-ogy. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter andintra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the numberof labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-speciﬁc unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlyingmulti-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsuper-vised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to e ﬀ ectivelytransfer the pretrained representations to downstream tasks based on prediction consistency with the task-speciﬁc un-labeled data.We carry out extensive validation experiments on three histopathology benchmark datasets across two classiﬁ-cation and one regression based tasks, i.e., tumor metastasis detection, tissue type classiﬁcation, and tumor cellu-larity quantiﬁcation. Under limited-label data, the proposed method yields tangible improvements, which is closeor even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empiricallyshow that the idea of bootstrapping the self-supervised pretrained features is an e ﬀ ective way to improve the task-speciﬁc semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: https://github.com/srinidhiPY/SSL_CR_Histo Keywords:

Self-supervised learning, Consistency training, Semi-supervised learning, Limited annotations, Histologyimage analysis, Digital pathology.

1. Introduction

Deep neural network models have a achieved tremen-dous success in obtaining state-of-the-art performanceon various histology image analysis tasks ranging fromdisease grading, cancer classiﬁcation to outcome pre-diction (Srinidhi et al., 2021; Bera et al., 2019; Litjens ∗ Corresponding author:

E-mail address: [email protected] (Chetan L. Srinidhi) et al., 2017). The main success of these methods is at-tributed to the availability of large-scale open datasetswith clean manual annotations. However, collecting sucha large corpus of labeled data is often expensive, labori-ous and requires skillful domain expertise, notably in thehistopathology domain (Madabhushi and Lee, 2016). Re-cently, self-supervised and semi-supervised approachesare becoming increasingly popular to alleviate the annota-tion burden by leveraging the readily available unlabeled a r X i v : . [ c s . C V ] F e b ata that can be trained with limited supervision. Thesemethods have recently demonstrated promising results onvarious computer vision (Jing and Tian, 2020; Laine andAila, 2016; Sohn et al., 2020) and medical image analysistasks (Chen et al., 2019; Tellez et al., 2019b; Li et al.,2020c). In this paper, we focus on the self-superviseddriven semi-supervised learning paradigm for histologyimage analysis by e ﬃ ciently exploiting the underlying in-formation present in unlabeled data, both in task-agnostic and task-speciﬁc ways.The existing plethora of self-supervised learning (SSL)methods can be viewed as deﬁning a surrogate task, i.e.,a pretext task - which is formulated using only unlabeledexamples, and which requires a high-level semantic un-derstanding of the image to solve these tasks (Jing andTian, 2020). The neural network model trained to solvethis pretext task often learns useful visual representationsthat can be transferred to any downstream task to solvethe task-speciﬁc problem. On the other hand, anotherimportant stream of work is based on semi-supervisedlearning (SmSL), which seeks to learn from both labeledand unlabeled examples, with limited manual annotations(Chapelle et al., 2010). Among SmSL methods, the mostrecent and popular stream of approaches are based on consistency regularization (Laine and Aila, 2016; Sajjadiet al., 2016) and pseudo-labeling (hyun Lee, 2013; Sohnet al., 2020). The consistency enforcing strategy aims toconstrain network predictions to be invariant to input ormodel weight perturbations, such as adding noise to theinput data through di ﬀ erent image augmentations (Xieet al., 2019), network dropout (Srivastava et al., 2014) andstochastic depth (Huang et al., 2016). The main idea isthat the model should predict similar labels for both theinput image and its perturbed (augmented) version of thesame image. Approaches of this kind include temporal en-sembling (Laine and Aila, 2016), mean teacher (Tarvainenand Valpola, 2017) and virtual adversarial training (Miy-ato et al., 2018). Alternatively, pseudo-labeling imputesartiﬁcial (pseudo) labels for unlabeled data obtained frommodel class predictions, which is trained using labeleddata alone (Sohn et al., 2020). These SmSL approaches’success is attributed to the fact that these models implic-itly learn to ﬁt decision boundaries by grouping similarimages to share similar labels, forming high-density clus-ters in the input feature space.Despite signiﬁcant advancements among SSL and SmSL approaches, they still su ﬀ er from some major lim-itations. Several SSL methods assume that optimizing thepretext objective task will invariably yield suitable down-stream representations for the target task. However, manyrecent studies (Zoph et al., 2020; Yan et al., 2020; Goyalet al., 2019) have shown that SSL methods overﬁt to thepretraining objective and may not generalize well to thedownstream task. On the other hand, methods based onSmSL approaches generally struggle to learn e ﬀ ectivelywhen the number of labeled instances are scarce and alsonoisy (Rebu ﬃ et al., 2020). This is a typical scenario inhistopathology, where the number of manually labeled an-notations is small and labels are often noisy (Shi et al.,2020). Furthermore, when the ratio of labeled and unla-beled samples is highly imbalanced, models trained solelybased on consistency strategy have very low accuracy andhigher entropy, which prevents them from achieving high-conﬁdence scores (i.e., pseudo labels) on unlabeled data(Kim et al., 2020).To address these shortcomings, several recent studiesexplored the feasibility of integrating the merits of bothSSL and SmSL approaches to e ﬃ ciently exploit the lim-ited available labeled target data with abundant unlabeleddata, to enhance the performance on downstream tasks(Zhai et al., 2019; Rebu ﬃ et al., 2020). These approachesﬁrst aim to initialize a good latent representation of thedata by formulating a pretext objective in a task-agnosticway, without using any labels. Later, these pretrained rep-resentations are e ﬀ ectively transferred to the downstreamtasks by reinitializing these features via SmSL approachin a task-speciﬁc way. The idea of bootstrapping featurestrained via SSL algorithm has been shown to improve onan SmSL approach by preventing overﬁtting on the targetdomain (Zhai et al., 2019).In this paper, we take inspiration from the above obser-vations, and propose a novel self-supervised driven semi-supervised learning framework for histopathology imageanalysis, which harnesses the unlabeled data both in atask-agnostic and task-speciﬁc manner. To this end, weﬁrst present a simple yet e ﬀ ective, self-supervised pretexttask, namely, Resolution Sequence Prediction (RSP),which leverages the multi-resolution contextual informa-tion present in the pyramidal nature of histology whole-slide images (WSI’s). Our design choice is inspired bythe way a pathologist searches for cancerous regions ina WSI. Typically, a pathologist zooms in and out into2ach region, where the tissue is examined at high to lowresolution to obtain the details of individual cells andtheir surroundings. In this work, we show that exploit-ing such meaningful multi-resolution contextual informa-tion provides a powerful surrogate supervisory signal forunsupervised representation learning. Second, we furtherdevelop a ‘ teacher-student ’ semi-supervised consistencyparadigm by e ﬃ ciently transferring the self-supervisedpretrained representations to downstream tasks. Our ap-proach can be viewed as a knowledge distillation method(Hinton et al., 2015), where the self-supervised teachermodel learns to generate pseudo labels for the task-speciﬁc unlabeled data, which forces the student model tomake predictions consistent with the teacher model. Weexperimentally show that initializing the student modelwith the SSL pretrained teacher model achieves robust-ness against noisy input data (i.e., noise is injectedthrough various kinds of domain-speciﬁc augmentations)and helps learn faster than the teacher in practice. Ourwhole-framework is trained in an end-to-end manner toseamlessly integrate the information present in labeledand unlabeled data both in task-speciﬁc and task-agnosticways.The major contributions of this paper are: • We propose a novel self-supervised pretext task forgenerating unsupervised visual representations viapredicting the resolution sequence ordering in thepyramidal structure of histology WSI. • We compare against the state-of-the-art self-supervised pretraining methods based on generativeand contrastive learning techniques: Variational Au-toencoder (VAE) (Kingma and Welling, 2013) andMomentum Contrast (MoCo) (He et al., 2020), re-spectively. • We present a new ‘ teacher-student ’ semi-supervisedconsistency paradigm by e ﬃ ciently transferring theself-supervised pretrained representations to down-stream tasks based on prediction consistency withthe task-speciﬁc unlabeled data. • We extensively validate our method on three bench-mark datasets across two classiﬁcation and oneregression based histopathology tasks, i.e., tumormetastasis detection, tissue type classiﬁcation, and tumor cellularity quantiﬁcation. The proposed self-supervised method, along with consistency training,is shown to improve the performance on all threedatasets, especially in the less annotated data regime.The paper is organized as follows: we ﬁrst brieﬂy in-troduce the related work in Section 2. In Section 3, wepresent the detail of our proposed methodology. Datasetsand experimental results are described in Section 4. Fi-nally, we discuss our key ﬁndings and limitations of ourwork in Section 6, followed by conclusion in Section 7.

2. Related works

In this section, for brevity, we review only the recentdevelopments in self-supervised and semi-supervised rep-resentation learning literature that are closely relevant toour work.

Self-Supervised (SSL) representation learning has re-cently gained momentum in many medical image analysistasks for reducing the manual annotation burden. Theseapproaches aim to construct di ﬀ erent types of auxiliarypretext tasks, where the supervisory signals are gener-ated from the data itself. Such pretraining of convolu-tional neural network (CNN) designed to solve these pre-text tasks results in useful feature representations that canbe used to initialize a subsequent CNN on data with lim-ited labels. The design of pretext task is often based on domain-speciﬁc knowledge like image context restoration(Chen et al., 2019), anatomical position prediction (Baiet al., 2019), 3D distance prediction (Spitzer et al., 2018),Rubik’s cube recovery (Zhuang et al., 2019) and image in-trinsic spatial o ﬀ set prediction (Blendowski et al., 2019).For instance, Chen et al. (2019) proposed image contextrestoration task for 2D fetal ultrasound image classiﬁ-cation, CT abdominal multi-organ localization, and tu-mor segmentation in brain MR images. Blendowski et al.(2019) extend the context prediction task to 3D by design-ing image-intrinsic spatial o ﬀ set relations to learn pre-trained features. Similarly, Zhuang et al. (2019) extend theSS approach to 3D volumetric medical images by solvingRubik’s cube recovery task for brain hemorrhage classiﬁ-cation and tumor segmentation. Bai et al. (2019) proposedto learn anatomical position prediction as a supervisory3ignal for cardiac MR image segmentation. Spitzer et al.(2018) designed a pretext task based on 3D distance pre-diction between the two sampled patches from the samesubject for segmenting brain areas as the target task. Manysuch pretext tasks are designed based on ad-hoc heuris-tics, limiting the generalizability of learned representa-tions.An alternative stream of approach is based on gen-erative modeling (such as VAE (Kingma and Welling,2013), GAN-based models (Dumoulin et al., 2016; Don-ahue et al., 2016) and other variants of it), which implic-itly learn representations by minimizing the reconstruc-tion loss in the pixel space. Compared with the discrimi-native ones, generative approaches are overly focused onpixel-level details, thus, limiting their ability to modelcomplex structures present in an image. Recently, a newfamily of discriminative methods is proposed based on contrastive learning, which learns to enforce similaritiesin the latent space between similar / dissimilar pairs (Heet al., 2020; Oord et al., 2018). In such methods, similarityis deﬁned through maximising mutual information (Oordet al., 2018) or via di ﬀ erent data transformations (Chenet al., 2020). For example, Lu et al. (2019) combined at-tention based multiple instance learning with contrastivepredictive coding for weakly supervised histology classi-ﬁcation. Chaitanya et al. (2020) extended the contrastivelearning approach to segmentation of volumetric medicalimages by utilizing domain and problem-speciﬁc cues fore ﬃ cient segmentation in three MRI datasets. Finally, Liet al. (2020b) proposed a patient feature-based softmaxembedding to learn multi-modal SSL representations fordiagnosing retinal disease. The existing semi-supervised learning (SmSL) tech-niques can be broadly categorized into three groups: i) ad-versarial training -based (Zhang et al., 2017; Diaz-Pintoet al., 2019; Quiros et al., 2019); ii) graph -based (Shiet al., 2020; Javed et al., 2020; Aviles-Rivero et al., 2019);and iii) consistency -based (Li et al., 2020a; Zhou et al.,2020; Li et al., 2020c; Su et al., 2019; Liu et al., 2020) ap-proaches.

Adversarial training based SmSL approacheslearn a generative and a discriminate model simultane-ously by forcing the discriminator to output class labels,instead of estimating the input probability distribution asin a normal generative adversarial network (GAN). For example, Zhang et al. (2017) proposed a segmentationand evaluation network, where the segmentation networkis encouraged to obtain segmentation mask for unlabeledimages; while, the evaluation network is forced to distin-guish the segmentation results with an annotated mask byassigning di ﬀ erent scores. Diaz-Pinto et al. (2019) pro-posed the GAN framework for retinal image synthesis byutilizing both labeled and unlabeled data for training aglaucoma classiﬁer. Meanwhile, Quiros et al. (2019) gen-erated pathologically meaningful representations to syn-thesize high ﬁdelity H&E breast cancer tissue images,which resemble that of real tissue ones. On the other hand, graph based methods construct a graph that establishesa semantic relationship between its neighbors and utilizethe transduction of the graph to assign labels to unlabeleddata via label propagation. As a typical example, Aviles-Rivero et al. (2019) proposed a graph-based SSL modelfor chest X-ray classiﬁcation, where the pseudo labels forunlabeled data are generated using label propagation. Inhistology, Javed et al. (2020) introduced a graph-basedcommunity detection algorithm for identifying seven tis-sue phenotypes in WSI’s. In more recent work, Shi et al.(2020) utilized a graph-based self-ensembling approachto create an ensemble target for each label prediction us-ing an exponential moving average (EMA); and mini-mizes the distance between label prediction and its en-semble target via consistency cost. Such self-ensemblingbased approaches are shown to be robust to noisy labelscompared to other graph-based methods.The most recent line of work in SmSL is based on con-sistency regularization, which enforces the consistencyof predictions to random perturbations such as data aug-mentations (French et al., 2017), stochastic regulariza-tion (Laine and Aila, 2016; Sajjadi et al., 2016), andadversarial perturbation (Miyato et al., 2018). More re-cently, Tarvainen and Valpola (2017) proposed the meanteacher (MT) framework that averages the model weightsinstead of EMA of the label predictions to enhance thequality of consistency targets. These strategies were re-cently extended to several medical image analysis tasks.For instance, Li et al. (2020c) introduced a transfor-mation consistent self-ensembling model for segmentingthree medical data. Several extensions to MT have alsobeen explored by enforcing prediction consistency eitherin region-based (Zhou et al., 2020), relation-based (Liuet al., 2020; Su et al., 2019) or cross-domain based (Li4t al., 2020a), which is subjected under various domain-speciﬁc perturbations.

3. Methods

An overview of the proposed self-supervised drivenconsistency training approach is illustrated in Fig. 1. Ourframework consists of three main stages: i) We pretraina self-supervised model F pre on an unlabeled set D pre toobtain task-agnostic feature representations. ii) We ﬁne-tune the SSL model on a limited amount of labeled data D f l to obtain the task-speciﬁc features. iii) We further aimto improve the downstream performance on the target taskby using both labeled D f l and unlabeled D f u data in a task-speciﬁc semi-supervised manner. Both teacher F t and stu-dent F s networks are initialized with the ﬁne-tuned model F f t for consistency training on the target task. The mainobjective is to optimize the student network, which learnsto minimize the supervised loss on labeled set ( D f l ) andconsistency loss on an unlabeled set ( D f u ). During con-sistency training, the teacher network is trained to predictthe pseudo-label on a weakly augmented unlabeled im-age. A student network then tries to match this pseudo-label by making its prediction on a strongly augmentedversion of the same unlabeled image. We update only thestudent network weights during training while keeping theteacher network weights frozen, and we make the studentas a new teacher after every epoch and iterate until con-vergence. We ﬁrst start by describing the self-supervised repre-sentation learning framework based on three di ﬀ erent pre-text categories (Jing and Tian, 2020), namely: context-based ; generative-based and contrastive-based methods.The pretext tasks are designed to solve the task-agnosticproblem in a self-supervised manner, where the class la-bels to train the network are generated automatically fromthe data itself. These pretrained representations can betransferred to multiple downstream tasks by ﬁne-tuninga network on the limited labeled training examples in atask-speciﬁc way. In our work, we ﬁrst start pretraininga convolutional neural network (convNet) F pre on an un-labeled pretraining set D pre to obtain generalized featurerepresentations via task-agnostic manner. Let us denote the pretraining set as D pre = { x i } Mi = con-sisting of M unlabeled training samples. In histopathol-ogy, the input x i ∈ R HxWx denotes the RGB image patchsampled from a gigapixel WSI, with height ( H ) and width( W ); and y i ∈ C is the class label for x i , with C = { , } for classiﬁcation or R for regression. Our goal is to learnfeature embedding F θ ( . ) in an unsupervised manner, thatmaps an unlabeled set F θ ( x i ) to a low-dimensional em-bedding F θ ( x i ) : R HxWx → R d , with d being the featuredimension and F ( . ) denotes the neural network parame-terized by θ .Given a set of M training samples D pre = { x i } Mi = , theself-supervised training aims to optimize the followingobjective, L pre = min θ M M (cid:88) i = loss ( x i , p i ) , (1)where, p i are the pseudo labels generated automaticallyfrom the self-supervised pre-text tasks. In this paper,we investigate several popular self-supervised pretrainingparadigms for histopathology, including the generative-based Variational Autoencoder (VAE) , contrastive-based Momentum Contrastive Coding (MoCo) , and ﬁ-nally, the proposed context-based

Resolution SequencePrediction (RSP) framework. The details are presentednext.

Our self-supervised design choice for the “Resolutionsequence prediction (RSP)” task is inspired by how apathologist examines a WSI during diagnosis for poten-tial abnormalities. Typically, a pathologist switches mul-tiple times between lower magniﬁcation levels for con-text and higher magniﬁcation levels for detail. Such multi-resolution multi-ﬁeld-of-view (FOV) analysis is possibledue to the WSI’s pyramidal nature, where the multipledownsampled versions of the original image are stored ina pyramidal structure.In this work, we exploit this multi-resolution nature ofWSIs by proposing a novel self-supervised pretext task -which learns image representations by training convNetsto predict the order of all possible sequences of resolu-tion that can be created from the input multi-resolutionpatches. We argue that solving this resolution prediction5 igure 1: Our self-supervised driven consistency training approach for histopathology image analysis. Our approach consists of three main stages:i) we pretrain a self-supervised (SSL) model F pre on the unlabeled set D pre to obtain task-agnostic feature representations; ii) we ﬁne-tune the SSLmodel on a limited amount of labeled data D f l to obtain the task-speciﬁc downstream features; iii) we further improve the downstream performanceon the target task by using both labeled ( D f l ) and unlabeled ( D f u ) set in a task-speciﬁc semi-supervised manner. Both teacher F t and student F s networks are initialized with ﬁne-tuned model F f t for consistency training on the target task. The main objective is to optimize the student network,which learns to minimize the supervised loss L s on labeled set D f l and consistency loss L c on an unlabeled set D f u . The consistency loss ismeasured between the pseudo labels produced by the teacher network on weakly augmented unlabeled input images with the labels predicted by thestudent network on strongly augmented unlabeled input images. Note: all the above network ( F pre , F f t , F t , F s ) share the same backbone ResNet-18architecture. task will allow a CNN to learn useful visual represen-tations that inherently capture both contextual informa-tion (at lower magniﬁcation) and ﬁne-grained details (at higher magniﬁcation levels).Speciﬁcally, we create -tuples of randomly shu ﬄ edmulti-resolution patches sampled from input WSI. Weformulate our resolution sequence prediction task as a multi-class classiﬁcation problem. Formally, we constructa tuple of three concentric multi-resolution RGB imagepatches ( S , S , S ) ∈ R P×P× extracted at three di ﬀ er-ent magniﬁcation levels, such that the spatial resolutionof S << S << S (measured in µ m / px ). By extractingsuch multiple concentric same size patches ( P × P × S ) lies insidethe central square region of the other two ( S , S ) lowermagniﬁcation patches. A sample set of multi-resolutionconcentric patches are shown in Fig. 2. These sets ofpatches form an input tuple to our self-supervised RSPframework. For brevity, we only consider a tuple of threeinput patches from a given WSI, for which 3! = resolution sequence ordering ), as illustrated in Fig. 2.To achieve our goal, given an input multi-resolution se-quence x ∈ ( S , S , S ) P ∈ R P×P× - among P possiblepermutations, we aim to train a siamese convNet model(Koch et al., 2015) F pre to predict the label y ∈ R P (i.e.,order of resolution sequences over P ∈ (1 , , ...,

6) possi-ble classes), which is given by, F pre ( x | θ ) = { F ypre ( x | θ ) } Py = , (2)where, F ypre ( x | θ ) is the predicted class probability forthe input sequence x , with label y , and θ being the learn-able parameter of the model F ( . ). Therefore, given a set of M training samples from the unlabeled set D pre = { x i } Mi = ,the convNet model learns to solve the objective functiondeﬁned in Eq. 1, by minimizing the categorical cross-entropy (CE) loss deﬁned by, loss ( x i , y i ) = − log ( F ypre ( x i | θ )) . (3)The proposed RSP framework has three main stages:i) feature extraction; ii) pair-wise feature extraction; and6 igure 2: Resolution sequence prediction (RSP) pretext task that we propose for self-supervised representation learning. Given a tuple of three inputmulti-resolution patches sampled from the 3! = F ( . ) to predict the label y ∈ R P corresponding to the order of resolution sequence, where, P ∈ (1 , , ..., three following stages:i) feature extraction; ii) pair-wise feature extraction; and iii) resolution sequence prediction. Features ( d = d = d = d = P . iii) resolution sequence prediction. In the ﬁrst stage, weadopt the siamese based architecture to obtain features foreach input multi-resolution patches, where all three net-work branches share the same parameters. In our work,we adopt the commonly used ResNet-18 model to obtainthe features h i = F ( x i ), after the global average poolinglayer; where, h i ∈ R d is a latent vector of dimension 512.An additional crucial part of self-supervised pretrainingis preparing the training data. To prevent the model frompicking up on low-level cues and learning only trivial fea-tures, we make the sequence prediction task more di ﬃ cultby applying various geometric transformations on the in-put data. The details of these geometric transformationsare discussed thoroughly in Section 4.2. In the second stage, we perform pair-wise feature extraction on the ex-tracted feature vector h i , to capture the intrinsic relation-ship between the multi-resolution frames. Speciﬁcally, weconcatenate features of each pair of input patches (i.e., Concat ( h , h ) , Concat ( h , h ) , Concat ( h , h )) to ob-tain h i j ∈ R d = feature vector. Next, we use a multi-layer perceptron (MLP) with one hidden layer to obtain z i = g ( h i j ) = W σ ( W h i j ) ∈ R d = ; where, σ denotesReLU and the bias is ignored for simplicity. Finally, inthe third stage, the pair-wise features ( z i ’s) are concate-nated, resulting in d = x Momentum Contrast model (MoCo) (He et al., 2020)is one of the most popular self-supervised models thateven outperforms supervised baseline models. Given adata point x in a dataset, MoCo samples a positive pair k + and N negative pairs k − , ..., k N − . MoCo is trained withinfoNCE loss (Oord et al., 2018), deﬁned as L in f oNCE = − E x ∼ p ( x ) (cid:104) exp ( F q ( x ) · F k ( k + ) /τ ) (cid:80) Ni = exp ( F q ( x ) · F k ( k i − ) /τ ) (cid:105) , (4)where, F q and F k are neural networks, and τ is a hyper-parameter for temperature. This is a log loss of a softmaxclassiﬁer which minimizes the di ﬀ erence between the rep-resentations F q ( x ) and its positive pair F k ( k + ) while max-imizing the di ﬀ erences between F q ( x ) and negative pairs F k ( k ,..., N − ). Note that minimizing L in f oNCE maximizes thelower bound for mutual information between x and k + (Oord et al., 2018). However, the bound is not tight fora small number of N ; therefore, in practice, we need touse a large number of negative samples for each iteration.However, as this is not practical for computational e ﬃ -ciency, MoCo maintains a large queue of encoded data.At each training iteration, the entire mini-batch consist-ing of a positive sample and negative samples are inserted7nto the queue. Therefore, we use the entire queue (ex-cept the positive sample) as the set of negatives for theinfoNCE loss. One of the key observation made by MoCois that this can be problematic if the encoder F changestoo quickly, as this would cause the discrepancy betweenthe distribution of the samples in the queue and the newsamples to be quite di ﬀ erent, and the classiﬁer can easilydecrease the loss. To solve this problem, MoCo uses twonetworks: the encoder F q with parameters θ q and the mo-mentum encoder F k with parameters θ k . F k is not trainedwith the infoNCE loss but is updated with momentum pa-rameter m : θ k = m θ k + (1 − m ) θ q , (5)after each training iteration. We use the queue size of8192 and m of 0.999, and adopt multiple augmentationschemes. In each training iteration, for each data x , we randomly i) jitter the brightness, contrast, saturation, andhue by 0.6 ∼ ∼

360 degrees, iii) ﬂipvertically & horizontally, and iv) crop with an area in therange 0.7 ∼ Variational autoencoder (VAE) (Kingma and Welling,2013) is an unsupervised machine learning model that isoften used for dimensionality reduction and image gener-ation. The model contains an encoder and a decoder, witha latent space that has a dimension smaller than the inputdata. The reduction in dimension on the latent space helpsextract the prominent information in the original data. Un-like the vanilla autoencoders, VAE assumes that input datacomes from some latent distribution z ∼ N (0 , I ). The en-coder estimates the mean ( µ ) and variance ( σ ) of the datain the latent space, and the decoder samples a point fromthe distribution for data reconstruction. The assumption of z following a normal distribution and the stochastic prop-erty of the latent vector force the model to create a con-tinuous latent space with similar data closer in the space.This resolves the model overﬁtting due to irregularitiesin the latent space often observed in the conventional au-toencoder. The learning rule of VAE is to maximize theevidence lower bound ( ELBO ), ELBO = E z ∼ q ( z | x ) log p ( x | z ) − KL ( q ( z | x ) || p ( z )) ≤ log p ( x ) , (6)where, q is the approximate posterior distribution of p ( z | x ). The ﬁrst term describes the reconstruction loss of the autoencoder model. The second term is can be seenas a regularizer that forces the approximate latent distri-bution to be close to N (0 , I ). Standard stochastic gradientdescent methods cannot directly apply to the model be-cause of the stochastic property of the latent vector. Thesolution, called the reparameterization trick, is to intro-duce a new random variable (cid:15) ∼ N (0 , I ) as the model in-put and set the latent vector to z = µ ( x ) + σ / ( x ) · (cid:15) . Thisallows all model parameters to be deterministic.For our VAE model, we use a ResNet-18 model to en-code input image of size 256 ×

256 to a latent vector of size512. Then, we use the generator from the BigGan model(Brock et al., 2018) to reconstruct the latent vector backto the original image.

The unsupervised learned representations are nowtransferred to the downstream task using limited labeleddata D f l in a task-speciﬁc way. It is a common practice toﬁne-tune the entire pretrained network when the down-stream data is large and similar to original pretraineddata. Hence, we choose to ﬁne-tune all layers in the pre-trained network F pre by initializing with the pretrainedweights to obtain task-speciﬁc representations: F f t ( x i ) = W f t F pre ( x i ); where, W f t is the weight for the task-speciﬁclinear layer. Speciﬁcally, we ﬁne-tune the entire network(all layers) with limited labels, along with a linear classi-ﬁer or a regressor trained on the top of learned represen-tations to obtain task-speciﬁc predictions. The goal of consistency training (CR) is to obtain sim-ilar model predictions for di ﬀ erently augmented versionsof the same input image (Laine and Aila, 2016; Sajjadiet al., 2016). We leverage this idea to further improvethe task-speciﬁc (downstream) performance by using thesecond set of unlabeled data D f u in a task-speciﬁc semi-supervised manner. In general, most existing SSL ap-proaches utilize the entire task-speciﬁc training set D f toﬁne-tune the pretrained model on the downstream tasks.The main objective of SSL is to develop universal fea-ture representations that can solve a wide variety of taskson many datasets. Although many recent pretraining ap-proaches (Chen et al., 2020; He et al., 2020; Chen et al.,2019; Chaitanya et al., 2020) have shown tremendous suc-cess in both computer vision and medical imaging, but8hey still fail to adapt to the new target tasks. A recentstudy by Zoph et al. (2020) reveals that the value of pre-training diminishes with stronger data augmentation andwith the use of more task-speciﬁc label data. Further, theauthors have shown that the self-supervised pretrainingbeneﬁts only when ﬁne-tuned with a limited amount of la-beled data; whereas, the model performance deteriorateswith the use of a more extensive label set. This raises animportant question to “what degree the SSL works andhow much amount of labeled data do we need to ﬁne-tunethe pretrained SSL model”.In our work, we focus on answering the above ques-tion by performing a set of control experiments by varyingthe amount of labeled data in both low-data and high-dataregimes on three di ﬀ erent histology datasets. To this end,we provide an elegant solution based on teacher-student consistency training to improve the downstream perfor-mance by exploiting the unlabeled data in a task-speciﬁcsemi-supervised manner.Our teacher-student consistency training (shown inFig.1) has three main steps: i ) We initialize the ﬁne-tuned model F f t as both teacherF t and student F s network; with teacher model weightsbeing frozen across all layers (entire network), exceptthe last linear layer (classiﬁer / regressor); while, studentmodel weights are frozen only until the output of globalaverage pooling layer, with an (MLP with one hiddenlayer and linear classiﬁer / regressor) trained on the top oflearned task-speciﬁc feature representations. ii ) We use the teacher network F t to generate pseudo la-bels on the deliberately noised unlabeled data D f u . Next,a student network F s is trained both via standard super-vised loss (on labeled data) and consistency loss (on unla-beled data), i.e., the supervised loss is evaluated by com-paring against the ground-truth labels (cross-entropy (CE)for classiﬁcation / mean squared error (MSE) for regres-sion task); while, the consistency loss (CE for classiﬁca-tion / MSE for regression task) is obtained by comparingagainst the pseudo labels (i.e., logits for regression / one-hot labels for classiﬁcation) of the teacher model. iii ) We update the weights of only the student modeland iterate these steps by treating the student as a newteacher after every epoch to relabel the unlabeled data andtrain a new student. In this way, our teacher-student con-sistency approach propagates the label information to theunlabeled data by constraining the model predictions to be consistent with the unlabeled data under di ﬀ erent dataaugmentations.We start by describing our method in the context ofsemi-supervised learning (SmSL) paradigm for the down-stream task. Let us consider the training data D f (ﬁne-tuning set) consisting of total N samples, out of which N are labeled inputs: D f l = { x i , y i } Ni = , and µ N are unlabeled inputs: D f u = { x i } µ Ni = . Where, µ is a hyperparameter thatdetermines the relative ratio of D f l and D f u . In practice, weinclude all labeled instances D f l as a part of unlabeled set,without using their labels, when constructing D f u . Furthernote that, we use di ﬀ erent batch sizes for the labeled andunlabled data such that µ N >> N . Formally, we aim tominimize the following objective (total loss):min θ N (cid:88) i = L s ( F s ( x i ; θ s ) , y i ) + λ L c ( { x i } µ Ni = ; F ( . ) , θ t , η w , θ s , η s ) , (7)where, L s is the supervised loss measured against the la-beled inputs and L c is the consistency loss evaluated be-tween the same unlabeled inputs with di ﬀ erent data aug-mentations. The term λ is the weighting factor which isempirically set as 1, that controls the trade-o ﬀ betweenthe supervised and consistency loss. F ( . ) denotes the Con-vNet model parameterized by θ , with θ t and θ s are theweights of the teacher and student network, respectively;while, η w and η s represents the weak and strong data aug-mentations applied to teacher and student model, respec-tively.Earlier works on consistency training (Sohn et al.,2020; Xie et al., 2019; Tarvainen and Valpola, 2017; Liuet al., 2020; Li et al., 2020) mainly focused on improv-ing the quality of consistency targets (pseudo labels) byusing either of the two strategies: i) careful selection ofdomain-speciﬁc data augmentations; or ii) selection ofbetter teacher model rather than the simple replication ofstudent network. However, there exist some limitationswith the above approaches: First, the predicted pseudolabels for the unlabeled data may be incorrect since themodel itself is used to generate them. Suppose, if a higherweight is assigned to these , the quality of learning maybe hampered due to misclassiﬁcation, and the model maysu ﬀ er from conﬁrmation bias (Arazo et al., 2020). Second,instead of using a converged model (such as pretrained)to generate pseudo labels with high conﬁdence scores, the9odels are trained from scratch leading to lower accuracyand high entropy.In this work, we aim to overcome these limitations byproviding a solution that leverages the advantage of boththe above solutions in a simple, e ﬃ cient manner. Themain key di ﬀ erence between our approach and the otherexisting consistency training methods are two fold: i) wemake use of the task-speciﬁc ﬁne-tuned model to gen-erate high-conﬁdence (i.e., low-entropy) consistency tar-gets instead of relying on the model being trained; ii) weexperimentally show that by aggressively injecting noisethrough various domain-speciﬁc data augmentations, thestudent model is forced to work harder to maintain consis-tency with the pseudo-label produced by teacher model.This ensures that the student network doesn’t merelyreplicate the teacher’s knowledge.More formally, we deﬁne the consistency loss L c for regression task, as the distance between the prediction ofteacher network F t (with weights θ t and noise η w ) with theprediction of student model F s (with weights θ s and noise η s ): L regressionc = µ N (cid:88) i = E η w ,η s (cid:107) F t ( x i , θ t , η w ) − F s ( x i , θ s , η s ) (cid:107) , (8)where, x i denotes each unlabeled training sample. In con-trast, for classiﬁcation task, the consistency loss is calcu-lated via standard cross-entropy (CE) loss deﬁned by: L classi f icationc = µ N (cid:88) i = { max( q i ) } H(arg max( q i ) , ˆ q i ) , (9)where, q i = p t ( y i | η w ( x i )) be the predicted class proba-bility by the teacher network F t for input x i applied with weak augmentation ( n w ) and ˆ q i = p s ( y i | η s ( x i )) be thepredicted class probability by the student network F s forinput x i , applied with strong augmentation ( n s ). H(.) de-notes the CE between two probability distributions and arg max ( q i ) is the pseudo-label produced by the teachernetwork on weakly unlabeled input image η w ( x i ). In thiswork, we leverage two kinds of augmentations: weak andstrong. The weak augmentation includes simple horizon-tal ﬂip and cropping; while for strong augmentation, weuse RandAugment technique (Cubuk et al., 2020). Thecomplete list of data augmentations and their parametersettings are listed in Section 4.2. During training, we only update the weights of the stu-dent network while keeping the teacher network weightsfrozen. The weights are updated by learning an MLP(with one hidden layer) and a task-speciﬁc linear classi-ﬁer / regressor on the output of the global average poolinglayer, with the rest of the layer weights frozen for the stu-dent network. The idea of ﬁne-tuning the last layers (i.e.,one hidden layer MLP and a linear classiﬁer / regressor)of the student model improves the task-speciﬁc perfor-mance by using both labeled and unlabeled data in a task-speciﬁc way. This is because the e ﬀ ect of pretraining andmost feature re-use happens in the lowest layers of thenetwork, while ﬁne-tuning higher layers change represen-tations that are well adapted to downstream tasks. Thisobservation was also shown to be consistent in a recentstudy by Raghu et al. (2019). After every epoch, we makethe student as the new teacher F t ← F s and iterate thisprocess until the model converges. The pseudocode forour proposed consistency training is illustrated in Algo-rithm 1.

4. Experiments

We evaluate the e ﬃ cacy of our method on one re-gression and two classiﬁcation tasks on histopathologybenchmark datasets, including BreastPathQ (Martel et al.,2019), Camelyon16 (Bejnordi et al., 2017) and Kathermulti-class (Kather et al., 2019). Further, we also showextensive ablation experiments and compare them withstate-of-the-art SSL methods by varying di ﬀ erent percent-ages of labeled data.For baselines, we compare our SSL approach (i.e.,RSP) with two other popular SSL methods, includingthe supervised one: VAE (Kingma and Welling, 2013),MoCo (He et al., 2020), and the random weight initial-ized (supervised). To further evaluate our approach ontask-speciﬁc consistency training, we ﬁne-tune the samepretrained models for the second time using di ﬀ erent per-centages of task-speciﬁc labeled data. In our experiments,we ﬁrst initialize the teacher-student model with the ﬁne-tuned SSL model trained on di ﬀ erent percentages of la-beled data: 10%, 25%, 50%, and 100% (depicted as “self-supervised pretraining and supervised ﬁne-tuning” in Ta-ble 3, 4, 5). Next, we train each of these ﬁne-tuned mod-els again for the second time using labeled and unlabeledsamples, again by varying percentages of labeled data,10 lgorithm 1: Consistency training pseudocode

Inputs: D f l = { x i , y i } Ni = , D f u = { x i } µ Ni = µ = ratio of unlabeled data λ = weighting factor for consistency loss F ft = ﬁne-tuned model F t = teacher model with parameter θ t F s = student model with parameter θ s η w = set of weak augmentations η s = set of strong augmentations η = set of augmentations applied to labeled data Initialize: F t ← F ft , with weights frozen across entirenetwork F s ← F ft , with weights frozen until output ofglobal average pooling layer; and training anMLP with one hidden layer + a linearclassiﬁer / regressor for t in [1, num epochs] do for each minibatch B do z Si ∈ B ← F s ( η ( x i ∈ B )) z Ui ∈ µ B ← F t ( η w ( x i ∈ µ B )) ˆ z Ui ∈ µ B ← F s ( η s ( x i ∈ µ B )) q i = p t ( y i | η w ( x i ); θ t ) // prediction computed by F t ˆ q i = p s ( y i | η s ( x i ); θ s ) // prediction computed by F s L classifications = − | B | (cid:80) i ∈ ( B ∩ N ) log z Si [ y i ] // supervised loss for classiﬁcation L regressions = − | B | (cid:80) i ∈ ( B ∩ N ) (cid:107) z Si − y i (cid:107) // supervised loss for regression L classificationc = | µ B | (cid:80) i ∈ ( B ∩ µ N ) { max( q i ) } H(argmax( q i ) , ˆ q i ) // consistency loss for classiﬁcation L regressionc = | µ B | (cid:80) i ∈ ( B ∩ µ N ) (cid:107) z Ui − ˆ z Ui (cid:107) // consistency loss for regression loss ← L s + λ L c update θ s using optimizer end F t ← F s // make student as the new teacher and goback to step 2 end return θ s ; and report the ﬁnal results (depicted as “consistency train-ing (CR)” as shown in Table 3, 4, 5). Note: this exper-imental setting is kept standard across all three datasetsfor a fair evaluation.

The distribution of the number of WSI’s and their cor-responding patches in all three datasets used for our ex-periments is shown in Table 1. In this section, we brieﬂydescribe all three publicly available datasets, whereas thedata-speciﬁc implementations such as pretraining, ﬁne-tuning, and test splits adopted in our experiments are ex-plained in their respective subsections.

BreastPathQ dataset:

This is a publicly availabledataset consisting of hematoxylin and eosin (H&E)stained 96 WSI’s of post-NAT-BRCA specimens (Martelet al., 2019; Peikari et al., 2017), which are scanned at 20 × magniﬁcation level (0 . lm / px ). A set of 2579 patcheseach with dimension (512 × Camelyon16 dataset:

We performed classiﬁcation ofbreast cancer metastases at image level on the datasetfrom Camelyon16 challenge (Bejnordi et al., 2017).This dataset contains 399 H&E stained WSI’s of lymphnodes in the breast, which is split into 270 for train-ing and 129 for testing. The images are acquired fromtwo di ﬀ erent centers scanned at a magniﬁcation of 40 × (0 . µ m × . µ m ) and 20 × (0 . µ m × . µ m ) Table 1: The total number of WSI’s and the patches used in each datasetto perform experiments.

Datasets Pretrain Fine-tune TestTrain Validation Train ValidationBreastPathQ WSIs 69 69 25Patches 10000 3000 2063 516 1121Camelyon16 WSIs 60 210 129Patches 62156 10000 306303 40000 –KatherMulti-class WSIs – – –Patches – – 80K 20K 7180

Kather multiclass dataset:

This dataset contains twosubsets of patches containing nine tissue classes: adipose,background, debris, lymphocytes, mucus, smooth mus-cle, normal colon mucosa, cancer-associated stroma, andcolorectal cancer epithelium (Kather et al., 2019). Outof the two subsets, the training set consists of 100K im-age patches of H&E stained colorectal cancer images of(224 × . µ m / pixel spatial reso-lution. In contrast, the set contains 7180 image patches.In this dataset, only patches are made available withoutaccess to WSIs. We perform all our experiments by selecting ResNet-18as the base feature embedding network, with the methodsoutlined in Section 3 on all three datasets. All the exper-iments are performed on 4 Tesla NVIDIA V100 GPUs,and the entire framework is implemented in PyTorch. Weﬁrst specify the implementation details common to alldatasets, and data speciﬁc implementations are providedin Table 2.For self-supervised pretraining : The model is trainedfor 250 epochs with a batch size of 64. We employ(SGD with Nesterov momentum + Lookahead) optimizer(Zhang et al., 2019), with a momentum of 0.9, weightdecay of 1 e − and a constant learning rate of 0.01. ForLookahead, we set k = α = .

5. The best pretrained model is chosen based on thelowest validation loss across BreastPathQ, Camelyon16,and Kather datasets.We use domain-speciﬁc data augmentations recom-mended by Tellez et al. (2019a), including rotations, hor-izontal ﬂips, scaling, additive Gaussian noise, brightness,and contrast perturbations, shifting hue and saturationvalues in HSV color space, perturbations in H&E colorspace. We also add random resized crops, blur and a ﬃ netransformations to the previous list. Speciﬁcally, we usea rotation factor between [ − ◦ , + ◦ ], scaling factor be-tween [0.8, 1.2], additive Gaussian noise with [ µ = σ = (0 , . ﬃ ne transformation with translation, scaleand rotation limit of [0 . , . , ◦ ], respectively, hueand saturation intensity ratio between [-0.1, 0.1] and [-1, 1], respectively, brightness and contrast intensity ra- Table 2: List of hyperparameters used in our experiments across all threedatasets.

Hyperparameters BreastPathQ Camelyon16 KatherMulticlass S up e r v i s e d ﬁ n e - t un i ng Epochs 90 90 90Batch size 4 16 64Learningrate ( lr ) 0.0001 0.0005 0.00001Optimizer Adam β , β = (0 . , . w d = e − − SGD withNesterov momentumof 0.9, w d = e − − Adam β , β = (0 . , . w d = e − − Scheduler MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1Selection ofbest model Lowest validationloss Highest validationaccuracy Highest validationaccuracy C on s i s t e n c y t r a i n i ng Epochs 90 90 90Batch size 4 8 8Ratio ofunlabeled data( µ ) 7 7 7Learningrate ( lr ) 0.0001 0.0005 0.00001Optimizer Adam β , β = (0 . , . w d = e − − SGD withNesterov momentumof 0.9, w d = e − − Adam β , β = (0 . , . w d = e − − Scheduler MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1Selection ofbest model Lowest validationloss Highest validationaccuracy Highest validationaccuracy tios between [-0.2, 0.2], blurring the input image usinga random-sized kernel in the range [3, 7], and randomlyresizing and cropping the image patch to its original im-age size. Finally, we perturb the intensity of hematoxylinand eosin (HED) color channels with a factor of [-0.035,0.035]. We apply all these transformations in sequence byrandomly selecting them in each mini-batch to obtain adiverse set of training images.For supervised ﬁne-tuning : We ﬁne-tune the entirepretrained SSL model (all layers) with a linear classiﬁeror a regressor trained on top of the learned representa-tions, with limited labels (10%, 25%, 50% and 100% oflabeled examples), to directly evaluate the performance ofRSP, VAE, and MoCo models. In particular, for RSP, wediscard entirely the last MLP with one hidden layer afterpretraining and ﬁne-tune with a linear layer on the top of d = (256 × =

768 dimensional ( d ) embedding followedby softmax to obtain task-speciﬁc predictions. However,for VAE and MoCo, we ﬁne-tune with a linear layer onthe 512 − d feature vector.For ﬁne-tuning, we use di ﬀ erent sets of hyperparame-12ers for all three datasets, which are provided in Table 2.Further, we include a simple set of augmentations ( η asdepicted in Algorithm 1), such as rotation, scaling, andrandom resized crops. For rotation and scaling, we use afactor of [ − ◦ , + ◦ ] and [0.8, 1.2], respectively, and werandomly resize and crop the image patch to its originalimage size.For consistency training : We use a semi-supervisedapproach for consistency training by using labeled andunlabeled examples in a task-speciﬁc manner. We adoptthe same task-speciﬁc ﬁne-tuned model as both teacherand student network, with teacher network weights frozen(all layers); while training a student network with onehidden layer MLP and a task-speciﬁc linear layer (clas-siﬁer / regressor) on the output of global average poolinglayer (with rest of the layer weights frozen). All the hy-perparameters related to consistency training are shownin Table 2. In our experiments, we initialize the teacher-student model with the ﬁne-tuned SSL model trained ondi ﬀ erent percentages of labeled data (10%, 25%, 50%,and 100%). Next, we train each of these ﬁne-tuned mod-els again for the second time using labeled and unlabeledsamples, again by varying percentages of labeled data,and report the ﬁnal results.In our work, we use two kinds of augmentations forconsistency training: “weak” and “strong” augmentationfor teacher and student network, respectively. For theteacher network, we employ simple transformations suchas horizontal ﬂip and random cropping to its original im-age size as weak augmentations. Whereas for the studentnetwork, we adopt a similar set of transformations to thepretraining stage, but with di ﬀ erent hyperparameters tostrengthen the augmentation severity, which is referredto as strong augmentations. The following are the list ofaugmentations with di ﬀ erent parameters to the pretrain-ing stage: an a ﬃ ne transformation with translation limitof [0.01, 0.1], scale limit of [0.51, 0.60] and rotation of90 ◦ , HSV intensity ratio between [-1, 1] and blurring theinput image using a random-sized kernel in the range [5,7]. We apply these augmentations in sequential by ran-domly selecting them in each mini-batch using the Ran-dAugment technique (Cubuk et al., 2020). In our experi-ments, we use ( N Aug = , M g = [1 , N Aug denotes the number of augmentations to ap-ply sequentially in each mini-batch, and M g is the mag- nitude that is sampled within a pre-deﬁned range [1, 10]that controls the severity of distortion in each mini-batch. In this experiment, we train our approach to automat-ically quantify tumor cellularity (TC) scores in digitizedslides of breast cancer images for tumor burden assess-ment. TC score is deﬁned as the percentage of the totalarea occupied by the malignant tumor cells in a given im-age patch (Peikari et al., 2017). For pretraining the SSLapproach, we adopted 69 WSI’s of the training set, fromwhich we randomly extract patches of size (256 × × , 10 × , and 5 × magniﬁcation for RSP, while for VAEand MoCo, patches are extracted at 20 × magniﬁcation.We perform ﬁne-tuning by resizing the image patches to(256 × ﬃ cient (ICC) valuesbetween the proposed methods and the two pathologistsA and B.Table 3 presents the ICC values for di ﬀ erent method-ologies, and the corresponding TC scores produced byeach method on sample WSIs of the BreastPathQ testset are shown in Fig. A.3 (shown in Appendix). Theconsistency training (CR) improved the results of self-supervised pretrained models (VAE, MoCo, and RSP) bya 3% increase in ICC values. Further, all SSL and CRmethods (VAE, MoCo, and RSP) seem to exhibit opti-mal performance, which is close or even outperforming tothat of supervised baseline (random) on all training sub-sets. Among all the methods, the RSP + CR achieves thebest score of greater than 0 .

90, which even surpassed theintra-rater agreement score of 0 .

89 (Akbar et al., 2019).Besides, our obtained TC score of 0 .

90 on the Breast-PathQ test set is superior to state-of-the-art (SOTA) meth-ods (Akbar et al., 2019; Rakhlin et al., 2019), with amaximum score of 0.883. Speciﬁcally, our RSP + CR ap-proach achieves a minimum of 4% greater ICC value13 able 3: Results on BreastPathQ dataset. Predicting the percentage of tumor cellularity (TC) at patch-level (intra-class correlation (ICC) coe ﬃ cientsbetween two pathologists A and B). The 95% conﬁdence intervals (CI) are shown in square brackets. We bold the best results. Methods% Training Data ICC Coe ﬃ cient (95% CI)Pathologist A Pathologist B Pathologist A Pathologist B Pathologist A Pathologist B Pathologist A Pathologist B10% (206 labels) 25% (516 labels) 50% (1031 labels) 100% (2063 labels)Self-supervised pretraining + Supervised ﬁne-tuningRandom 0.697 [0.67, 0.73] 0.637 [0.60, 0.67] 0.786 [0.76, 0.81] 0.727 [0.70, 0.75] 0.812 [0.79, 0.83] 0.797 [0.77, 0.82] 0.863 [0.85, 0.88] 0.843 [0.83, 0.86]VAE

RSP (ours)

Consistency training (CR)Random + CR 0.658 [0.62, 0.69] 0.630 [0.59, 0.66] 0.818 [0.80, 0.84] 0.802 [0.78, 0.82] 0.847 [0.83, 0.86] 0.839 [0.82, 0.86] 0.891 [0.88, 0.90] 0.891 [0.88, 0.90]VAE + CR 0.771 [0.75, 0.79] 0.727 [0.70, 0.75] 0.842 [0.82, 0.86] 0.826 [0.81, 0.84] 0.866 [0.85, 0.88] 0.857 [0.84, 0.87] 0.884 [0.87, 0.90] 0.864 [0.85, 0.88]MoCo + CR 0.808 [0.79, 0.83] 0.803 [0.78, 0.82] 0.872 [0.86, 0.89]

RSP + CR (ours) 0.876 [0.86, 0.89] 0.846 [0.83, 0.86] 0.873 [0.86, 0.89] than VAE + CR and MoCo + CR, and at least 17% improve-ment in ICC value to the supervised baseline, trained on10% labeled set ( ≈

206 image patches). In contrast, ona complete training set, all CR methods exhibit compet-itive / similar performance. This indicates that the consis-tency training improves upon self-supervised pretrainingpredominantly in the low-data regime. This experiment is a slide-based binary classiﬁcationtask to identify the presence of lymph node metastasis inWSIs using only slide-level labels. To experiment with thelimited annotations, we ﬁrst perform self-supervised pre-training on 60 WSI’s (35 normal and 25 tumor), which areset aside from the original training set. For pretraining, werandomly extract patches of size (256 × × , 20 × ,and 10 × magniﬁcation for RSP, while for VAE and MoCo,patches are extracted at 40 × magniﬁcation. Further, thedownstream ﬁne-tuning is performed on the randomly ex-tracted patches of size (256 × + +

20K normal) for validation. We ﬁnally eval-uate the methods on 129 WSI’s of the test set (as shown inTable 1). We divide the ﬁne-tuning set containing 306.3Kpatches into four incremental subsets of [10%, 25%, 50%,100%] containing [30.6K, 76.5K, 153.1K, 306.3K] imagepatches, respectively.We follow the same post-processing steps as Wanget al. (2016) to obtain slide-level predictions. We ﬁrst trainour proposed models to discriminate patch-level tumor vs. normal patches. We then aggregate these patch-level pre-dictions to create a heat map of tumor probability over theslide. Next, we extract several features similar to Wanget al. (2016) from the heat map and train a slide-level sup-port vector machine (SVM) classiﬁer to make the slide-level prediction. We compare and evaluate all three SSLpretrained and CR methods with the corresponding super-vised baseline. The method’s performance is evaluated interms of area under the receiver operating characteristiccurve (AUC) on a test set containing 129 WSIs. In ad-dition, we also evaluate the binary classiﬁcation perfor-mance (accuracy (Acc)) on the patch-level data containing40K patches (20K tumor +

20K normal) of the validationtest. Further, we perform the statistical signiﬁcance test bycomparing the pairs of AUCs between consistency train-ing and SSL methods using the two-tailed Delong’s test(Sun and Xu, 2014). All di ﬀ erences in AUC value with a p -value < .

05 were considered as signiﬁcant.Table 4 presents the AUC scores for predicting slide-level tumor metastasis using di ﬀ erent methodologies. On10% label regime, RSP and MoCo methods outperformedthe supervised baseline, whereas the performance of VAEis signiﬁcantly decreased compared to other methods.Further, the RSP + CR approach signiﬁcantly outperformsthe RSP by a margin of 2% on 10% and 25% labeled set.The proposed RSP + CR achieves the best score of 0.917using 25% labeled set ( ≈

76K patches) compared to thewinning method in Camelyon16 challenge (Wang et al.,2016), which obtained an AUC of 0.925 using the fullysupervised model trained on millions of image patches.Compared with the unsupervised representation learningmethods proposed in Tellez et al. (2019b), our RSP + CR14 able 4: Results on Camelyon16 dataset. Predicting the presence of tumor metastasis at WSI level (AUC) and patch-level classiﬁcation performance(accuracy). The DeLong method (Sun and Xu, 2014) was used to construct 95% CIs, which are shown in square brackets. The best scores are shownin bold. Note: the patch-level accuracy (Acc) is reported on 40K patches of the validation set.

Methods% Training Data AUC Acc AUC Acc AUC Acc AUC Acc10% (30630 labels) (4000 labels) 25% (76576 labels) (10000 labels) 50% (153151 labels) (20000 labels) 100% (306303 labels) (40000 labels)Self-supervised pretraining + Supervised ﬁne-tuningRandom 0.804 [0.72 - 0.89]

VAE 0.737 [0.64 - 0.83] 0.827 0.814 [0.73 - 0.89] 0.864 0.830 [0.75 - 0.91] 0.906 0.818 [0.73 - 0.90] 0.907MoCo

RSP (ours) + CR 0.659 [0.54 - 0.77]

VAE + CR 0.633 [0.55 - 0.72] 0.828 0.719 [0.63 - 0.81] 0.863 0.741 [0.64 - 0.84] 0.918 0.779 [0.69 - 0.87] 0.928MoCo + CR 0.728 [0.63 - 0.82] 0.835 0.742 [0.64 - 0.84] 0.902 0.766 [0.67 - 0.86] 0.929 0.825 [0.75 - 0.90] 0.946

RSP + CR (ours) 0.855 [0.78 - 0.92] approach trained on 10% labels ( ≈

30K patches) out-performs their top-performing BiGAN method by 13%higher AUC trained on 50K labeled samples. Addition-ally, we also evaluated our methods’ performance on thevalidation set containing 40K patches (20K tumor + + CR) outperformed the RSP, RSP + CR methodsby a slight margin di ﬀ erence of 0.5% Acc on all percent-ages of training subsets.Most importantly, from our experiments on the Came-lyon16 dataset, we draw several insights on the generalityof our approach on low- and high-labels training scenar-ios. On a low-label data regime, i.e., the patch-wise clas-siﬁcation task on the validation set, which has training la-bels ranging from 4K to 40K, we observe that adding con-sistency training improved the SSL model performanceup to 2% increase in Acc values. AUCs of consistencytrained models are statistically higher than AUCs of SSLpretrained models with p -value < .

02, across 10% and25% labeled set. As we increase the number of labeledsamples (50% to 100%), adding the consistency trainingto the Random, VAE, and MoCo SSL pretrained modelsresulted in a noticeable drop in AUC values. The resultsfor the RSP model still improved after consistency train-ing in the high-label data regime, but these di ﬀ erenceswere not statistically signiﬁcant. Thus, in general, our ap-proach has been shown to work well in a limited annota-tion setting, which is highly beneﬁcial in the histopathol-ogy domain.Further, we also observe that pretraining performanceslightly diminishes with an increase in the amount of la- beled data (from 10% (30K) to 100% (306K) labels),which essentially deteriorates the value of pretrained fea-tures and is consistent with the recent study by Zoph et al.(2020). Overall, our consistency training approach contin-ues to improve the task-speciﬁc performance only whentrained with low-label data, and it is additive to pretrain-ing.Fig. A.4 (shown in Appendix) highlight the tumor prob-ability heat-maps produced by di ﬀ erent methodologies.Visually all self-supervised pretrained methods (VAE,MoCo, and RSP) were shown to focus on tumor areaswith high probability, while the supervised baseline ex-hibits slightly lower probability values for the same tu-mor regions. We observe that most methods successfullyidentify the macro-metastases (Row 1-3), with a tumor di-ameter larger than 2mm, with an excellent agreement withthe ground truth annotation. However, the same methodsstruggle to precisely identify the micro-metastases (Row4), with tumor diameter smaller than 2mm, which is gen-erally challenging even for the fully-supervised models. Due to the unavailability of access to WSIs in thisdataset, we could not perform self-supervised pretrainingon this dataset. However, instead, we used the SSL pre-trained model of Camelyon16 to ﬁne-tune and evaluatethe patch-level performance for feature transferability be-tween datasets with di ﬀ erent tissue types / organs and reso-lution protocols. In our experiments, the downstream ﬁne-tuning is performed on 100k image patches of the training15et and tested on 7180 images of the test set by resizingthe patches to (256 × F score ( F ) for classiﬁcation of 9 colorec-tal tissue classes using di ﬀ erent methodologies. On thisdataset, the MoCo + CR approach obtains a new state-of-the-art result with an Acc of 0.990, weighted F scoreof 0.953, and a macro AUC of 0.997, compared to theprevious method (Kather et al., 2019) which obtained anAcc of 0.943. This underscores that our pretrained ap-proaches are more generalizable to unseen domains withdi ﬀ erent organs, tissue types, staining, and resolution pro-tocols. All the consistency trained methods marginallyoutperform the SSL pretrained models on all subsetsof a labeled set. Further, the CR methods (RSP + CR,MoCo + CR, VAE + CR) outperform the supervised base-line by 3% and 17% increase in Acc and F score, respec-tively. Compared to the previous representation learningmethods (Pati et al., 2020), our approach obtains 3% im-provement in Acc by training on just 10% labels, com-pared to the previous method (Acc of 0.951) trained using100% labels. Thus, in general, our approach can be ap-plied to other domain-adaptation problems in histopathol-ogy, where target annotations are often limited or some-times unavailable.

5. Ablation studies

In this section, we perform the ablation experiments tostudy the importance of three components of our method:(i) ratio of unlabeled data; (ii) impact of strong augmen-tations on student network; (iii) convergence behavior ofconsistency training. We choose to perform these ablationstudies on 10% labeled data on BreastPathQ and Came-lyon16 datasets due to time constraints. Further, we ex-clude the Kather Multiclass dataset, as it was used toevaluate the feature transferability between datasets, thusmaking it less suitable for this extensive study.

The success of consistency training is mainly attributedto the amount of unlabeled data. From Table 6, we observea marginal to noticeable improvement in performance aswe increase the ratio of unlabeled to labeled batch size( µ ). This is consistent with the recent studies in Xie et al.(2019) and Sohn et al. (2020). For each fold increase in the ratio between unlabeled and labeled samples, the per-formance improves by at least 2% on both BreastPathQand Camelyon16. However, the performance in Breast-PathQ is quite negligible since the number of trainingsamples (2063 patches) is substantially less compared toCamelyon16 (306K patches). On the other hand, increas-ing the ratio of unlabeled data while ﬁne-tuning the pre-trained model tends to converge faster than training themodel from scratch. In essence, a large amount of unla-beled data is beneﬁcial to obtain better performance dur-ing consistency training. The success of teacher-student consistency training iscrucially dependent on the di ﬀ erent strong augmentationpolicies applied to the student network. Table 7 depictsthe analysis of the impact of augmentation policies on ﬁ-nal performance. In our experiments, we apply each ofthese augmentations sequentially by randomly selectingthem in each mini-batch using the RandAugment (Cubuket al., 2020) technique. We vary the total number of aug-mentations ( N Aug ) from value 1 to 7 and examine the ef-fect of strong augmentation policies (applied to the stu-dent network) during consistency training. From Table7, we observe that as we gradually increase the severityof augmentation policies in the student model, there aremarginal to noticeable improvements in the performancegain. This improvement is mainly visible when trained onlarge amounts of unlabeled data (such as Camelyon16),where there is a minimum 3% improvement in AUC aswe increase the augmentation strength. This suggests thatadding strong augmentations to the student network is es-sential to avoid the model being learned just the teacher’sknowledge and gain further improvements in task-speciﬁcperformance.

6. Discussions

With the advancements in deep learning techniques,current histopathology image analysis methods haveshown excellent human-level performance on varioustasks such as tumor detection (Campanella et al., 2019),cancer grading (Bulten et al., 2020), and survival predic-tion (Wulczyn et al., 2020), etc. However, to achieve thesesatisfactory results, these methods require a large amountof labeled data for training. Acquiring such massive anno-tations is laborious and tedious in practice. Thus, there is a16 able 5: Results on Kather Multiclass dataset. Classiﬁcation of nine tissue types at patch-level (accuracy (Acc), weighted F score ( F )). Thisexperiment is performed to assess the generalizability of pretrained features between di ﬀ erent tissue types and resolutions. Pretraining is performedon Camelyon16 (Breast) and tested on Kather Multiclass (Colon). We bold the best results. Methods% Training Data Acc F Acc F Acc F Acc F

10% (8000 labels) 25% (20000 labels) 50% (40000 labels) 100% (80000 labels)Self-supervised pretraining + Supervised ﬁne-tuningRandom 0.972 0.873 0.974 0.885 0.979 0.905 0.983 0.920VAE 0.963 0.835 0.972 0.885 0.980 0.908

MoCo

RSP (ours) + CR 0.938 0.670 0.943 0.735 0.941 0.723 0.939 0.707VAE + CR 0.972 0.876 0.979 0.906 0.978 0.903 0.982 0.915MoCo + CR RSP + CR (ours)

Table 6: Impact of ratio of unlabeled data ( µ ). These experiments areperformed with N =

10% labeled and µ N unlabled samples in eachmini-batch. Note: the intra-class correlation (ICC) coe ﬃ cient is evalu-ated between two pathologists A and B for BreastPathQ. Ratio ofUnlabeled Data( µ ) BreastPathQ Camelyon16ICC ( P A ) ICC ( P B ) AUC Acc1 0.871 0.851 0.738 0.9042 0.871 0.851 0.785 0.9033 0.876 0.846 0.797 0.9074 0.876 0.856 0.803 0.9115 0.880 0.861 0.810 0.9146 great potential to explore self / semi-supervised approachesthat can alleviate the annotation burden by e ﬀ ectively ex-ploiting the unlabeled data. Drawing on this spirit, in thiswork, we propose a self-supervised driven consistencytraining method for histology image analysis by lever-aging the unlabeled data in both task-agnostic and task-speciﬁc manner. We ﬁrst formulate the self-supervisedpretraining as the resolution sequence prediction task thatlearns meaningful visual representations across multipleresolutions in WSI. Next, a teacher-student consistencytraining is employed to improve the task-speciﬁc per-formance based on prediction consistency with the un-labeled data. Our method is validated on three publichistology datasets, i.e., BreastPathQ, Camelyon16, andKather Multiclass, in which our method consistently out-performs other self-supervised methods and also with thesupervised baseline under a limited-label regime. Ourmethod has also shown its e ﬃ cacy in transferring pre- Table 7: Impact of strong augmentation policies applied to the studentnetwork. These number of possible transformations are applied sequen-tially by randomly selecting them in each mini-batch.

No of PossibleTransformations( N Aug ) BreastPathQ Camelyon16ICC ( P A ) ICC (P B ) AUC Acc1 0.883 0.863 0.569 0.8952 trained features across di ﬀ erent datasets with di ﬀ erent tis-sue types / organs and resolution protocols.Despite the excellent performance of our method, thereis one main limitation: i.e., if the pseudo labels producedby the teacher network are inaccurate, then the studentnetwork is forced to learn from incorrect labels leadingto conﬁrmation bias (Arazo et al., 2020). As a result, thestudent may not become better than the teacher duringconsistency training. We solved this issue with RandAug-ment (Cubuk et al., 2020), a strong data augmentationtechnique, which we combine with label smoothing (softpseudo labels) to deal with conﬁrmation bias. This is alsoconsistent with the recent study (Arazo et al., 2020) thatshowed soft pseudo labels outperform hard pseudo labelswhen dealing with label noise. However, the bias issuestill persists with soft pseudo labels in our application.This is prominently visible in our method, where, com-pared to self-supervised pretraining (see Fig. A.4, col-umn (c) - (f); Fig. A.3, column (b) - (e)), the consis-17ency trained approaches (see Fig. A.4, column (g) - (j);Fig. A.3, column (f) - (i)) exhibits some low probability( < .

5) spurious pixels outside the malignant cell bound-aries. This happens because of the naive pseudo labelingproduced by the teacher network, which sometimes over-ﬁts to incorrect pseudo labels. Further, this issue is rein-forced when we attempt to train the student network onunlabeled samples with incorrect pseudo labels leading toconﬁrmation bias. One solution to mitigate this issue isto make the teacher network constantly adapt to the feed-back of the student model instead of the teacher modelbeing ﬁxed. This has shown to work well in a recent metapseudo label technique (Pham et al., 2020), where bothteacher and student are trained in parallel by making theteacher learn from the reward signal of the student perfor-mance on a labeled set. Exploring this idea is beyond thescope of this work, and we will leave this to the practi-tioner to explore more along these lines.In general, our proposed self-supervised driven consis-tency training framework has a great potential to solvethe majority of both classiﬁcation and regression tasks incomputational histopathology, where annotation scarcityis a signiﬁcant issue. Further, our pretrained representa-tions are more generic and can be easily extended to otherdownstream multi-tasks, such as segmentation and sur-vival prediction. It is worth investigating further to de-velop a universal feature encoder in histopathology thatcan solve many tasks without the need for excessive la-beled annotations.

7. Conclusion

In this paper, we present an annotation e ﬃ cient frame-work by introducing a novel self-supervised driven con-sistency training paradigm for histopathology image anal-ysis. The proposed framework utilizes the unlabeleddata both in a task-agnostic and task-speciﬁc mannerto signiﬁcantly advance the accuracy and robustness ofthe state-of-the-art self-supervised (SSL) methods. Tothis end, we ﬁrst propose a novel task-agnostic self-supervised pretext task by e ﬃ ciently harnessing the multi-resolution contextual cues present in the histology whole-slide images. We further develop a task-speciﬁc teacher-student semi-supervised consistency method to e ﬀ ec-tively distill the SSL pretrained representations to down-stream tasks. This synergistic harness of unlabeled data has been shown to improve the SSL pretrained perfor-mance, over its supervised baseline, under a limited-labelregime. Extensive experiments on three public bench-mark datasets across two classiﬁcation and one regressionbased histopathology tasks, i.e., tumor metastasis detec-tion, tissue type classiﬁcation, and tumor cellularity quan-tiﬁcation, demonstrates the e ﬀ ectiveness of our proposedapproach. Our experiments also showed that our method’sperformance is signiﬁcantly outperforming or even com-parable to that of the supervised baseline when trainedunder limited annotation settings. Furthermore, our ap-proach is more generic and has been shown to gener-ate universal pretrained representations that can be eas-ily adapted to other histopathology tasks and also to otherdomains without any modiﬁcations. Conﬂict of interest

ALM is co-founder and CSO of Pathcore. CS, SK and FChave no ﬁnancial or non-ﬁnancial conﬂict of interests.

Acknowledgment

This research is funded by: Canadian Cancer Soci-ety (grant number 705772); National Cancer Instituteof the National Institutes of Health [grant numberU24CA199374-01]; Canadian Institutes of Health Re-search.

Appendix A. Supplementary material • Fig. A.3 Tumor cellularity scores produced on WSIsof the BreastPathQ test set for 10% labeled data. • Fig. A.4 Tumor probability heat-maps overlaid onoriginal WSIs from Camelyon16 test set predictedfrom 10% labeled data.

References

Akbar, S., Peikari, M., Salama, S., Panah, A.Y., Nofech-Mozes, S., Martel, A.L., 2019. Automated and manualquantiﬁcation of tumour cellularity in digital slides fortumour burden assessment. Scientiﬁc Reports 9, 1–9.18a) (b) (c) (d) (e) (f) (g) (h) (i)

Figure A.3: TC scores produced on WSIs of the BreastPathQ test set for 10% labeled data. (a) Original WSI overlaid with ground truth mask (annotation labels with pinksquare boxes denote 0% cellularity and green square boxes indicate 100% cellularity); (b) – (e) corresponds to TC score produced by random (supervised), VAE, MoCo, andRSP approach, respectively; (f) – (i) corresponds to TC score produced by random + CR (supervised), VAE + CR, MoCo + CR, and RSP + CR methods, respectively. The colorblue denotes healthy (0% TC), and red denotes malignant (100% TC). a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

Figure A.4: Tumor probability heat-maps overlaid on original WSIs from Camelyon16 test set predicted from 10% labeled data. (a) Original WSI; (b) Ground truth annotationmask; (c) – (f) corresponds to tumor probability produced by random (supervised), VAE, MoCo, and RSP approach, respectively; (g) – (i) corresponds to tumor probabilityproduced by random + CR (supervised), VAE + CR, MoCo + CR, and RSP + CR methods, respectively. The ﬁrst three rows correspond to examples of macro-metastases (tumorcell cluster diameter ≥ mm ), while the last row corresponds to micro-metastases (tumor cell cluster diameter from > . mm to < mm ). The color blue denotes healthyregions, and red denotes tumor regions. razo, E., Ortego, D., Albert, P., O’Connor, N.E.,McGuinness, K., 2020. Pseudo-labeling and conﬁr-mation bias in deep semi-supervised learning, in: 2020International Joint Conference on Neural Networks(IJCNN), pp. 1–8.Aviles-Rivero, A.I., Papadakis, N., Li, R., Sellars, P., Fan,Q., Tan, R.T., Sch¨onlieb, C.B., 2019. Graphx-net:Chest x-ray classiﬁcation under extreme minimal su-pervision. arXiv preprint arXiv:1907.10085 .Bai, W., Chen, C., Tarroni, G., Duan, J., Guitton, F., Pe-tersen, S.E., Guo, Y., Matthews, P.M., Rueckert, D.,2019. Self-supervised learning for cardiac mr imagesegmentation by anatomical position prediction, in: In-ternational Conference on Medical Image Computingand Computer-Assisted Intervention, pp. 541–549.Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken,B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A.,Hermsen, M., Manson, Q.F., Balkenhol, M., et al.,2017. Diagnostic assessment of deep learning al-gorithms for detection of lymph node metastases inwomen with breast cancer. JAMA 318, 2199–2210.Bera, K., Schalper, K.A., Rimm, D.L., Velcheti, V., Mad-abhushi, A., 2019. Artiﬁcial intelligence in digitalpathology—new tools for diagnosis and precision on-cology. Nature Reviews Clinical Oncology 16, 703–715.Blendowski, M., Nickisch, H., Heinrich, M.P., 2019. Howto learn from unlabeled volume data: self-supervised 3dcontext feature learning, in: International Conferenceon Medical Image Computing and Computer-AssistedIntervention, pp. 649–657.Brock, A., Donahue, J., Simonyan, K., 2018. Large scalegan training for high ﬁdelity natural image synthesis.arXiv preprint arXiv:1809.11096 .Bulten, W., Pinckaers, H., van Boven, H., Vink, R., de Bel,T., van Ginneken, B., van der Laak, J., Hulsbergen-van de Kaa, C., Litjens, G., 2020. Automated deep-learning system for gleason grading of prostate cancerusing biopsies: a diagnostic study. The Lancet Oncol-ogy 21, 233–241. Campanella, G., Hanna, M.G., Geneslaw, L., Miraﬂor, A.,Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E.,Klimstra, D.S., Fuchs, T.J., 2019. Clinical-grade com-putational pathology using weakly supervised deeplearning on whole slide images. Nature Medicine 25,1301–1309.Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E., 2020.Contrastive learning of global and local features formedical image segmentation with limited annotations.arXiv preprint arXiv:2006.10511 .Chapelle, O., Schlkopf, B., Zien, A., 2010. Semi-Supervised Learning. 1st ed., The MIT Press.Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara,M., Rueckert, D., 2019. Self-supervised learning formedical image analysis using image context restora-tion. Medical Image Analysis 58, 101539.Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020.A simple framework for contrastive learning of visualrepresentations. arXiv preprint arXiv:2002.05709 .Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V., 2020.Randaugment: Practical automated data augmentationwith a reduced search space, in: Proceedings of theIEEE / CVF Conference on Computer Vision and Pat-tern Recognition Workshops, pp. 702–703.Diaz-Pinto, A., Colomer, A., Naranjo, V., Morales, S.,Xu, Y., Frangi, A.F., 2019. Retinal image synthe-sis and semi-supervised learning for glaucoma assess-ment. IEEE Transactions on Medical Imaging 38,2211–2218.Donahue, J., Kr¨ahenb¨uhl, P., Darrell, T., 2016. Adversar-ial feature learning. arXiv preprint arXiv:1605.09782.Dumoulin, V., Belghazi, I., Poole, B., Mastropietro,O., Lamb, A., Arjovsky, M., Courville, A., 2016.Adversarially learned inference. arXiv preprintarXiv:1606.00704 .French, G., Mackiewicz, M., Fisher, M., 2017. Self-ensembling for visual domain adaptation. arXivpreprint arXiv:1706.05208 .21oyal, P., Mahajan, D., Gupta, A., Misra, I., 2019. Scal-ing and benchmarking self-supervised visual represen-tation learning, in: Proceedings of the IEEE Interna-tional Conference on Computer Vision, pp. 6391–6400.He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Mo-mentum contrast for unsupervised visual representationlearning, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 9729–9738.Hinton, G., Vinyals, O., Dean, J., 2015. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 .Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.,2016. Deep networks with stochastic depth, in: Euro-pean Conference on Computer Vision, pp. 646–661.Javed, S., Mahmood, A., Fraz, M.M., Koohbanani, N.A.,Benes, K., Tsang, Y.W., Hewitt, K., Epstein, D., Snead,D., Rajpoot, N., 2020. Cellular community detectionfor tissue phenotyping in colorectal cancer histologyimages. Medical Image Analysis , 101696.Jing, L., Tian, Y., 2020. Self-supervised visual featurelearning with deep neural networks: A survey. IEEETransactions on Pattern Analysis and Machine Intelli-gence , 1–1.Kather, J.N., Krisam, J., Charoentong, P., Luedde, T., Her-pel, E., Weis, C.A., Gaiser, T., Marx, A., Valous, N.A.,Ferber, D., et al., 2019. Predicting survival from col-orectal cancer histology slides using deep learning: Aretrospective multicenter study. PLoS medicine 16,e1002730.Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.,2020. Distribution aligning reﬁnery of pseudo-label forimbalanced semi-supervised learning. arXiv preprintarXiv:2007.08844 .Kingma, D.P., Welling, M., 2013. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 .Koch, G., Zemel, R., Salakhutdinov, R., 2015. Siameseneural networks for one-shot image recognition, in:ICML Deep Learning Workshop, pp. 1–8. Laine, S., Aila, T., 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.hyun Lee, D., 2013. Pseudo-label: The simple and e ﬃ -cient semi-supervised learning method for deep neuralnetworks, in: Workshop on challenges in Representa-tion Learning, ICML, pp. 1–6.Li, K., Wang, S., Yu, L., Heng, P.A., 2020a. Dual-teacher:Integrating intra-domain and inter-domain teachersfor annotation-e ﬃ cient cardiac segmentation. arXivpreprint arXiv:2007.06279 .Li, X., Jia, M., Islam, M.T., Yu, L., Xing, L., 2020b. Self-supervised feature learning via exploiting multi-modaldata for retinal disease diagnosis. IEEE Transactionson Medical Imaging .Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng, P.A.,2020c. Transformation-consistent self-ensemblingmodel for semisupervised medical image segmenta-tion. IEEE Transactions on Neural Networks andLearning Systems .Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng,P.A., 2020. Transformation-consistent self-ensemblingmodel for semisupervised medical image segmenta-tion. IEEE Transactions on Neural Networks andLearning Systems , 1–12.Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A.,Ciompi, F., Ghafoorian, M., Van Der Laak, J.A.,Van Ginneken, B., S´anchez, C.I., 2017. A survey ondeep learning in medical image analysis. Medical Im-age Analysis 42, 60–88.Liu, Q., Yu, L., Luo, L., Dou, Q., Heng, P.A., 2020. Semi-supervised medical image classiﬁcation with relation-driven self-ensembling model. IEEE Transactions onMedical Imaging .Lu, M.Y., Chen, R.J., Wang, J., Dillon, D., Mahmood, F.,2019. Semi-supervised histology classiﬁcation usingdeep multiple instance learning and contrastive predic-tive coding. arXiv preprint arXiv:1910.10825 .Madabhushi, A., Lee, G., 2016. Image analysis and ma-chine learning in digital pathology: Challenges and op-portunities. Medical Image Analysis 33, 170–175.22artel, A.L., Nofech-Mozes, S., Salama, S., Ak-bar, S., Peikari, M., 2019. Assessment of resid-ual breast cancer cellularity after neoadjuvantchemotherapy using digital pathology [data set].https: // doi.org / / TCIA.2019.4YIBTJNO.Miyato, T., Maeda, S.i., Koyama, M., Ishii, S., 2018. Vir-tual adversarial training: a regularization method forsupervised and semi-supervised learning. IEEE Trans-actions on Pattern Analysis and Machine Intelligence41, 1979–1993.Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representa-tion learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 .Pati, P., Foncubierta-Rodr´ıguez, A., Goksel, O., Gabrani,M., 2020. Reducing annotation e ﬀ ort in digital pathol-ogy: A co-representation learning framework for clas-siﬁcation tasks. Medical Image Analysis 67, 101859.Peikari, M., Salama, S., Nofech-Mozes, S., Martel, A.L.,2017. Automatic cellularity assessment from post-treated breast surgical specimens. Cytometry Part A91, 1078–1087.Pham, H., Xie, Q., Dai, Z., Le, Q.V., 2020. Meta pseudolabels. arXiv preprint arXiv:2003.10580 .Quiros, A.C., Murray-Smith, R., Yuan, K., 2019. Pathol-ogy gan: learning deep representations of cancer tissue.arXiv preprint arXiv:1907.02644 .Raghu, M., Zhang, C., Kleinberg, J., Bengio, S., 2019.Transfusion: Understanding transfer learning for med-ical imaging, in: Advances in Neural Information Pro-cessing Systems, pp. 3347–3357.Rakhlin, A., Tiulpin, A., Shvets, A.A., Kalinin, A.A.,Iglovikov, V.I., Nikolenko, S., 2019. Breast tumorcellularity assessment using deep neural networks, in:Proceedings of the IEEE International Conference onComputer Vision Workshops, pp. 0–0.Rebu ﬃ , S.A., Ehrhardt, S., Han, K., Vedaldi, A., Zisser-man, A., 2020. Semi-supervised learning with scarceannotations, in: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition Work-shops, pp. 762–763. Sajjadi, M., Javanmardi, M., Tasdizen, T., 2016. Regu-larization with stochastic transformations and perturba-tions for deep semi-supervised learning, in: Advancesin Neural Information Processing Systems, pp. 1163–1171.Shi, X., Su, H., Xing, F., Liang, Y., Qu, G., Yang,L., 2020. Graph temporal ensembling based semi-supervised convolutional neural network with noisy la-bels for histopathology image analysis. Medical ImageAnalysis 60, 101624.Sohn, K., Berthelot, D., Li, C.L., Zhang, Z., Carlini,N., Cubuk, E.D., Kurakin, A., Zhang, H., Ra ﬀ el, C.,2020. Fixmatch: Simplifying semi-supervised learn-ing with consistency and conﬁdence. arXiv preprintarXiv:2001.07685 .Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S.,Dickscheid, T., 2018. Improving cytoarchitectonic seg-mentation of human brain areas with self-supervisedsiamese networks, in: International Conference onMedical Image Computing and Computer-Assisted In-tervention, pp. 663–671.Srinidhi, C.L., Ciga, O., Martel, A.L., 2021. Deep neuralnetwork models for computational histopathology: Asurvey. Medical Image Analysis 67, 101813.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,Salakhutdinov, R., 2014. Dropout: a simple way to pre-vent neural networks from overﬁtting. The Journal ofMachine Learning Research 15, 1929–1958.Su, H., Shi, X., Cai, J., Yang, L., 2019. Local andglobal consistency regularized mean teacher for semi-supervised nuclei classiﬁcation, in: International Con-ference on Medical Image Computing and Computer-Assisted Intervention, pp. 559–567.Sun, X., Xu, W., 2014. Fast implementation of delong’salgorithm for comparing the areas under correlated re-ceiver operating characteristic curves. IEEE SignalProcessing Letters 21, 1389–1393.Tarvainen, A., Valpola, H., 2017. Mean teachers are bet-ter role models: Weight-averaged consistency targets23mprove semi-supervised deep learning results, in: Ad-vances in Neural Information Processing Systems, pp.1195–1204.Tellez, D., Litjens, G., B´andi, P., Bulten, W., Bokhorst,J.M., Ciompi, F., van der Laak, J., 2019a. Quantifyingthe e ﬀﬀ