Self-supervised driven consistency training for annotation efficient histopathology image analysis
Chetan L. Srinidhi, Seung Wook Kim, Fu-Der Chen, Anne L. Martel
SSelf-supervised driven consistency training for annotation e ffi cienthistopathology image analysis Chetan L. Srinidhi a,b, ∗ , Seung Wook Kim c , Fu-Der Chen d , Anne L. Martel a,b a Physical Sciences, Sunnybrook Research Institute, Toronto, Canada b Department of Medical Biophysics, University of Toronto, Canada c Department of Computer Science, University of Toronto, Canada d Department of Electrical & Computer Engineering, University of Toronto, Canada
Abstract
Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathol-ogy. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter andintra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the numberof labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-specific unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlyingmulti-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsuper-vised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to e ff ectivelytransfer the pretrained representations to downstream tasks based on prediction consistency with the task-specific un-labeled data.We carry out extensive validation experiments on three histopathology benchmark datasets across two classifi-cation and one regression based tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellu-larity quantification. Under limited-label data, the proposed method yields tangible improvements, which is closeor even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empiricallyshow that the idea of bootstrapping the self-supervised pretrained features is an e ff ective way to improve the task-specific semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: https://github.com/srinidhiPY/SSL_CR_Histo Keywords:
Self-supervised learning, Consistency training, Semi-supervised learning, Limited annotations, Histologyimage analysis, Digital pathology.
1. Introduction
Deep neural network models have a achieved tremen-dous success in obtaining state-of-the-art performanceon various histology image analysis tasks ranging fromdisease grading, cancer classification to outcome pre-diction (Srinidhi et al., 2021; Bera et al., 2019; Litjens ∗ Corresponding author:
E-mail address: [email protected] (Chetan L. Srinidhi) et al., 2017). The main success of these methods is at-tributed to the availability of large-scale open datasetswith clean manual annotations. However, collecting sucha large corpus of labeled data is often expensive, labori-ous and requires skillful domain expertise, notably in thehistopathology domain (Madabhushi and Lee, 2016). Re-cently, self-supervised and semi-supervised approachesare becoming increasingly popular to alleviate the annota-tion burden by leveraging the readily available unlabeled a r X i v : . [ c s . C V ] F e b ata that can be trained with limited supervision. Thesemethods have recently demonstrated promising results onvarious computer vision (Jing and Tian, 2020; Laine andAila, 2016; Sohn et al., 2020) and medical image analysistasks (Chen et al., 2019; Tellez et al., 2019b; Li et al.,2020c). In this paper, we focus on the self-superviseddriven semi-supervised learning paradigm for histologyimage analysis by e ffi ciently exploiting the underlying in-formation present in unlabeled data, both in task-agnostic and task-specific ways.The existing plethora of self-supervised learning (SSL)methods can be viewed as defining a surrogate task, i.e.,a pretext task - which is formulated using only unlabeledexamples, and which requires a high-level semantic un-derstanding of the image to solve these tasks (Jing andTian, 2020). The neural network model trained to solvethis pretext task often learns useful visual representationsthat can be transferred to any downstream task to solvethe task-specific problem. On the other hand, anotherimportant stream of work is based on semi-supervisedlearning (SmSL), which seeks to learn from both labeledand unlabeled examples, with limited manual annotations(Chapelle et al., 2010). Among SmSL methods, the mostrecent and popular stream of approaches are based on consistency regularization (Laine and Aila, 2016; Sajjadiet al., 2016) and pseudo-labeling (hyun Lee, 2013; Sohnet al., 2020). The consistency enforcing strategy aims toconstrain network predictions to be invariant to input ormodel weight perturbations, such as adding noise to theinput data through di ff erent image augmentations (Xieet al., 2019), network dropout (Srivastava et al., 2014) andstochastic depth (Huang et al., 2016). The main idea isthat the model should predict similar labels for both theinput image and its perturbed (augmented) version of thesame image. Approaches of this kind include temporal en-sembling (Laine and Aila, 2016), mean teacher (Tarvainenand Valpola, 2017) and virtual adversarial training (Miy-ato et al., 2018). Alternatively, pseudo-labeling imputesartificial (pseudo) labels for unlabeled data obtained frommodel class predictions, which is trained using labeleddata alone (Sohn et al., 2020). These SmSL approaches’success is attributed to the fact that these models implic-itly learn to fit decision boundaries by grouping similarimages to share similar labels, forming high-density clus-ters in the input feature space.Despite significant advancements among SSL and SmSL approaches, they still su ff er from some major lim-itations. Several SSL methods assume that optimizing thepretext objective task will invariably yield suitable down-stream representations for the target task. However, manyrecent studies (Zoph et al., 2020; Yan et al., 2020; Goyalet al., 2019) have shown that SSL methods overfit to thepretraining objective and may not generalize well to thedownstream task. On the other hand, methods based onSmSL approaches generally struggle to learn e ff ectivelywhen the number of labeled instances are scarce and alsonoisy (Rebu ffi et al., 2020). This is a typical scenario inhistopathology, where the number of manually labeled an-notations is small and labels are often noisy (Shi et al.,2020). Furthermore, when the ratio of labeled and unla-beled samples is highly imbalanced, models trained solelybased on consistency strategy have very low accuracy andhigher entropy, which prevents them from achieving high-confidence scores (i.e., pseudo labels) on unlabeled data(Kim et al., 2020).To address these shortcomings, several recent studiesexplored the feasibility of integrating the merits of bothSSL and SmSL approaches to e ffi ciently exploit the lim-ited available labeled target data with abundant unlabeleddata, to enhance the performance on downstream tasks(Zhai et al., 2019; Rebu ffi et al., 2020). These approachesfirst aim to initialize a good latent representation of thedata by formulating a pretext objective in a task-agnosticway, without using any labels. Later, these pretrained rep-resentations are e ff ectively transferred to the downstreamtasks by reinitializing these features via SmSL approachin a task-specific way. The idea of bootstrapping featurestrained via SSL algorithm has been shown to improve onan SmSL approach by preventing overfitting on the targetdomain (Zhai et al., 2019).In this paper, we take inspiration from the above obser-vations, and propose a novel self-supervised driven semi-supervised learning framework for histopathology imageanalysis, which harnesses the unlabeled data both in atask-agnostic and task-specific manner. To this end, wefirst present a simple yet e ff ective, self-supervised pretexttask, namely, Resolution Sequence Prediction (RSP),which leverages the multi-resolution contextual informa-tion present in the pyramidal nature of histology whole-slide images (WSI’s). Our design choice is inspired bythe way a pathologist searches for cancerous regions ina WSI. Typically, a pathologist zooms in and out into2ach region, where the tissue is examined at high to lowresolution to obtain the details of individual cells andtheir surroundings. In this work, we show that exploit-ing such meaningful multi-resolution contextual informa-tion provides a powerful surrogate supervisory signal forunsupervised representation learning. Second, we furtherdevelop a ‘ teacher-student ’ semi-supervised consistencyparadigm by e ffi ciently transferring the self-supervisedpretrained representations to downstream tasks. Our ap-proach can be viewed as a knowledge distillation method(Hinton et al., 2015), where the self-supervised teachermodel learns to generate pseudo labels for the task-specific unlabeled data, which forces the student model tomake predictions consistent with the teacher model. Weexperimentally show that initializing the student modelwith the SSL pretrained teacher model achieves robust-ness against noisy input data (i.e., noise is injectedthrough various kinds of domain-specific augmentations)and helps learn faster than the teacher in practice. Ourwhole-framework is trained in an end-to-end manner toseamlessly integrate the information present in labeledand unlabeled data both in task-specific and task-agnosticways.The major contributions of this paper are: • We propose a novel self-supervised pretext task forgenerating unsupervised visual representations viapredicting the resolution sequence ordering in thepyramidal structure of histology WSI. • We compare against the state-of-the-art self-supervised pretraining methods based on generativeand contrastive learning techniques: Variational Au-toencoder (VAE) (Kingma and Welling, 2013) andMomentum Contrast (MoCo) (He et al., 2020), re-spectively. • We present a new ‘ teacher-student ’ semi-supervisedconsistency paradigm by e ffi ciently transferring theself-supervised pretrained representations to down-stream tasks based on prediction consistency withthe task-specific unlabeled data. • We extensively validate our method on three bench-mark datasets across two classification and oneregression based histopathology tasks, i.e., tumormetastasis detection, tissue type classification, and tumor cellularity quantification. The proposed self-supervised method, along with consistency training,is shown to improve the performance on all threedatasets, especially in the less annotated data regime.The paper is organized as follows: we first briefly in-troduce the related work in Section 2. In Section 3, wepresent the detail of our proposed methodology. Datasetsand experimental results are described in Section 4. Fi-nally, we discuss our key findings and limitations of ourwork in Section 6, followed by conclusion in Section 7.
2. Related works
In this section, for brevity, we review only the recentdevelopments in self-supervised and semi-supervised rep-resentation learning literature that are closely relevant toour work.
Self-Supervised (SSL) representation learning has re-cently gained momentum in many medical image analysistasks for reducing the manual annotation burden. Theseapproaches aim to construct di ff erent types of auxiliarypretext tasks, where the supervisory signals are gener-ated from the data itself. Such pretraining of convolu-tional neural network (CNN) designed to solve these pre-text tasks results in useful feature representations that canbe used to initialize a subsequent CNN on data with lim-ited labels. The design of pretext task is often based on domain-specific knowledge like image context restoration(Chen et al., 2019), anatomical position prediction (Baiet al., 2019), 3D distance prediction (Spitzer et al., 2018),Rubik’s cube recovery (Zhuang et al., 2019) and image in-trinsic spatial o ff set prediction (Blendowski et al., 2019).For instance, Chen et al. (2019) proposed image contextrestoration task for 2D fetal ultrasound image classifi-cation, CT abdominal multi-organ localization, and tu-mor segmentation in brain MR images. Blendowski et al.(2019) extend the context prediction task to 3D by design-ing image-intrinsic spatial o ff set relations to learn pre-trained features. Similarly, Zhuang et al. (2019) extend theSS approach to 3D volumetric medical images by solvingRubik’s cube recovery task for brain hemorrhage classifi-cation and tumor segmentation. Bai et al. (2019) proposedto learn anatomical position prediction as a supervisory3ignal for cardiac MR image segmentation. Spitzer et al.(2018) designed a pretext task based on 3D distance pre-diction between the two sampled patches from the samesubject for segmenting brain areas as the target task. Manysuch pretext tasks are designed based on ad-hoc heuris-tics, limiting the generalizability of learned representa-tions.An alternative stream of approach is based on gen-erative modeling (such as VAE (Kingma and Welling,2013), GAN-based models (Dumoulin et al., 2016; Don-ahue et al., 2016) and other variants of it), which implic-itly learn representations by minimizing the reconstruc-tion loss in the pixel space. Compared with the discrimi-native ones, generative approaches are overly focused onpixel-level details, thus, limiting their ability to modelcomplex structures present in an image. Recently, a newfamily of discriminative methods is proposed based on contrastive learning, which learns to enforce similaritiesin the latent space between similar / dissimilar pairs (Heet al., 2020; Oord et al., 2018). In such methods, similarityis defined through maximising mutual information (Oordet al., 2018) or via di ff erent data transformations (Chenet al., 2020). For example, Lu et al. (2019) combined at-tention based multiple instance learning with contrastivepredictive coding for weakly supervised histology classi-fication. Chaitanya et al. (2020) extended the contrastivelearning approach to segmentation of volumetric medicalimages by utilizing domain and problem-specific cues fore ffi cient segmentation in three MRI datasets. Finally, Liet al. (2020b) proposed a patient feature-based softmaxembedding to learn multi-modal SSL representations fordiagnosing retinal disease. The existing semi-supervised learning (SmSL) tech-niques can be broadly categorized into three groups: i) ad-versarial training -based (Zhang et al., 2017; Diaz-Pintoet al., 2019; Quiros et al., 2019); ii) graph -based (Shiet al., 2020; Javed et al., 2020; Aviles-Rivero et al., 2019);and iii) consistency -based (Li et al., 2020a; Zhou et al.,2020; Li et al., 2020c; Su et al., 2019; Liu et al., 2020) ap-proaches.
Adversarial training based SmSL approacheslearn a generative and a discriminate model simultane-ously by forcing the discriminator to output class labels,instead of estimating the input probability distribution asin a normal generative adversarial network (GAN). For example, Zhang et al. (2017) proposed a segmentationand evaluation network, where the segmentation networkis encouraged to obtain segmentation mask for unlabeledimages; while, the evaluation network is forced to distin-guish the segmentation results with an annotated mask byassigning di ff erent scores. Diaz-Pinto et al. (2019) pro-posed the GAN framework for retinal image synthesis byutilizing both labeled and unlabeled data for training aglaucoma classifier. Meanwhile, Quiros et al. (2019) gen-erated pathologically meaningful representations to syn-thesize high fidelity H&E breast cancer tissue images,which resemble that of real tissue ones. On the other hand, graph based methods construct a graph that establishesa semantic relationship between its neighbors and utilizethe transduction of the graph to assign labels to unlabeleddata via label propagation. As a typical example, Aviles-Rivero et al. (2019) proposed a graph-based SSL modelfor chest X-ray classification, where the pseudo labels forunlabeled data are generated using label propagation. Inhistology, Javed et al. (2020) introduced a graph-basedcommunity detection algorithm for identifying seven tis-sue phenotypes in WSI’s. In more recent work, Shi et al.(2020) utilized a graph-based self-ensembling approachto create an ensemble target for each label prediction us-ing an exponential moving average (EMA); and mini-mizes the distance between label prediction and its en-semble target via consistency cost. Such self-ensemblingbased approaches are shown to be robust to noisy labelscompared to other graph-based methods.The most recent line of work in SmSL is based on con-sistency regularization, which enforces the consistencyof predictions to random perturbations such as data aug-mentations (French et al., 2017), stochastic regulariza-tion (Laine and Aila, 2016; Sajjadi et al., 2016), andadversarial perturbation (Miyato et al., 2018). More re-cently, Tarvainen and Valpola (2017) proposed the meanteacher (MT) framework that averages the model weightsinstead of EMA of the label predictions to enhance thequality of consistency targets. These strategies were re-cently extended to several medical image analysis tasks.For instance, Li et al. (2020c) introduced a transfor-mation consistent self-ensembling model for segmentingthree medical data. Several extensions to MT have alsobeen explored by enforcing prediction consistency eitherin region-based (Zhou et al., 2020), relation-based (Liuet al., 2020; Su et al., 2019) or cross-domain based (Li4t al., 2020a), which is subjected under various domain-specific perturbations.
3. Methods
An overview of the proposed self-supervised drivenconsistency training approach is illustrated in Fig. 1. Ourframework consists of three main stages: i) We pretraina self-supervised model F pre on an unlabeled set D pre toobtain task-agnostic feature representations. ii) We fine-tune the SSL model on a limited amount of labeled data D f l to obtain the task-specific features. iii) We further aimto improve the downstream performance on the target taskby using both labeled D f l and unlabeled D f u data in a task-specific semi-supervised manner. Both teacher F t and stu-dent F s networks are initialized with the fine-tuned model F f t for consistency training on the target task. The mainobjective is to optimize the student network, which learnsto minimize the supervised loss on labeled set ( D f l ) andconsistency loss on an unlabeled set ( D f u ). During con-sistency training, the teacher network is trained to predictthe pseudo-label on a weakly augmented unlabeled im-age. A student network then tries to match this pseudo-label by making its prediction on a strongly augmentedversion of the same unlabeled image. We update only thestudent network weights during training while keeping theteacher network weights frozen, and we make the studentas a new teacher after every epoch and iterate until con-vergence. We first start by describing the self-supervised repre-sentation learning framework based on three di ff erent pre-text categories (Jing and Tian, 2020), namely: context-based ; generative-based and contrastive-based methods.The pretext tasks are designed to solve the task-agnosticproblem in a self-supervised manner, where the class la-bels to train the network are generated automatically fromthe data itself. These pretrained representations can betransferred to multiple downstream tasks by fine-tuninga network on the limited labeled training examples in atask-specific way. In our work, we first start pretraininga convolutional neural network (convNet) F pre on an un-labeled pretraining set D pre to obtain generalized featurerepresentations via task-agnostic manner. Let us denote the pretraining set as D pre = { x i } Mi = con-sisting of M unlabeled training samples. In histopathol-ogy, the input x i ∈ R HxWx denotes the RGB image patchsampled from a gigapixel WSI, with height ( H ) and width( W ); and y i ∈ C is the class label for x i , with C = { , } for classification or R for regression. Our goal is to learnfeature embedding F θ ( . ) in an unsupervised manner, thatmaps an unlabeled set F θ ( x i ) to a low-dimensional em-bedding F θ ( x i ) : R HxWx → R d , with d being the featuredimension and F ( . ) denotes the neural network parame-terized by θ .Given a set of M training samples D pre = { x i } Mi = , theself-supervised training aims to optimize the followingobjective, L pre = min θ M M (cid:88) i = loss ( x i , p i ) , (1)where, p i are the pseudo labels generated automaticallyfrom the self-supervised pre-text tasks. In this paper,we investigate several popular self-supervised pretrainingparadigms for histopathology, including the generative-based Variational Autoencoder (VAE) , contrastive-based Momentum Contrastive Coding (MoCo) , and fi-nally, the proposed context-based
Resolution SequencePrediction (RSP) framework. The details are presentednext.
Our self-supervised design choice for the “Resolutionsequence prediction (RSP)” task is inspired by how apathologist examines a WSI during diagnosis for poten-tial abnormalities. Typically, a pathologist switches mul-tiple times between lower magnification levels for con-text and higher magnification levels for detail. Such multi-resolution multi-field-of-view (FOV) analysis is possibledue to the WSI’s pyramidal nature, where the multipledownsampled versions of the original image are stored ina pyramidal structure.In this work, we exploit this multi-resolution nature ofWSIs by proposing a novel self-supervised pretext task -which learns image representations by training convNetsto predict the order of all possible sequences of resolu-tion that can be created from the input multi-resolutionpatches. We argue that solving this resolution prediction5 igure 1: Our self-supervised driven consistency training approach for histopathology image analysis. Our approach consists of three main stages:i) we pretrain a self-supervised (SSL) model F pre on the unlabeled set D pre to obtain task-agnostic feature representations; ii) we fine-tune the SSLmodel on a limited amount of labeled data D f l to obtain the task-specific downstream features; iii) we further improve the downstream performanceon the target task by using both labeled ( D f l ) and unlabeled ( D f u ) set in a task-specific semi-supervised manner. Both teacher F t and student F s networks are initialized with fine-tuned model F f t for consistency training on the target task. The main objective is to optimize the student network,which learns to minimize the supervised loss L s on labeled set D f l and consistency loss L c on an unlabeled set D f u . The consistency loss ismeasured between the pseudo labels produced by the teacher network on weakly augmented unlabeled input images with the labels predicted by thestudent network on strongly augmented unlabeled input images. Note: all the above network ( F pre , F f t , F t , F s ) share the same backbone ResNet-18architecture. task will allow a CNN to learn useful visual represen-tations that inherently capture both contextual informa-tion (at lower magnification) and fine-grained details (at higher magnification levels).Specifically, we create -tuples of randomly shu ffl edmulti-resolution patches sampled from input WSI. Weformulate our resolution sequence prediction task as a multi-class classification problem. Formally, we constructa tuple of three concentric multi-resolution RGB imagepatches ( S , S , S ) ∈ R P×P× extracted at three di ff er-ent magnification levels, such that the spatial resolutionof S << S << S (measured in µ m / px ). By extractingsuch multiple concentric same size patches ( P × P × S ) lies insidethe central square region of the other two ( S , S ) lowermagnification patches. A sample set of multi-resolutionconcentric patches are shown in Fig. 2. These sets ofpatches form an input tuple to our self-supervised RSPframework. For brevity, we only consider a tuple of threeinput patches from a given WSI, for which 3! = resolution sequence ordering ), as illustrated in Fig. 2.To achieve our goal, given an input multi-resolution se-quence x ∈ ( S , S , S ) P ∈ R P×P× - among P possiblepermutations, we aim to train a siamese convNet model(Koch et al., 2015) F pre to predict the label y ∈ R P (i.e.,order of resolution sequences over P ∈ (1 , , ...,
6) possi-ble classes), which is given by, F pre ( x | θ ) = { F ypre ( x | θ ) } Py = , (2)where, F ypre ( x | θ ) is the predicted class probability forthe input sequence x , with label y , and θ being the learn-able parameter of the model F ( . ). Therefore, given a set of M training samples from the unlabeled set D pre = { x i } Mi = ,the convNet model learns to solve the objective functiondefined in Eq. 1, by minimizing the categorical cross-entropy (CE) loss defined by, loss ( x i , y i ) = − log ( F ypre ( x i | θ )) . (3)The proposed RSP framework has three main stages:i) feature extraction; ii) pair-wise feature extraction; and6 igure 2: Resolution sequence prediction (RSP) pretext task that we propose for self-supervised representation learning. Given a tuple of three inputmulti-resolution patches sampled from the 3! = F ( . ) to predict the label y ∈ R P corresponding to the order of resolution sequence, where, P ∈ (1 , , ..., three following stages:i) feature extraction; ii) pair-wise feature extraction; and iii) resolution sequence prediction. Features ( d = d = d = d = P . iii) resolution sequence prediction. In the first stage, weadopt the siamese based architecture to obtain features foreach input multi-resolution patches, where all three net-work branches share the same parameters. In our work,we adopt the commonly used ResNet-18 model to obtainthe features h i = F ( x i ), after the global average poolinglayer; where, h i ∈ R d is a latent vector of dimension 512.An additional crucial part of self-supervised pretrainingis preparing the training data. To prevent the model frompicking up on low-level cues and learning only trivial fea-tures, we make the sequence prediction task more di ffi cultby applying various geometric transformations on the in-put data. The details of these geometric transformationsare discussed thoroughly in Section 4.2. In the second stage, we perform pair-wise feature extraction on the ex-tracted feature vector h i , to capture the intrinsic relation-ship between the multi-resolution frames. Specifically, weconcatenate features of each pair of input patches (i.e., Concat ( h , h ) , Concat ( h , h ) , Concat ( h , h )) to ob-tain h i j ∈ R d = feature vector. Next, we use a multi-layer perceptron (MLP) with one hidden layer to obtain z i = g ( h i j ) = W σ ( W h i j ) ∈ R d = ; where, σ denotesReLU and the bias is ignored for simplicity. Finally, inthe third stage, the pair-wise features ( z i ’s) are concate-nated, resulting in d = x Momentum Contrast model (MoCo) (He et al., 2020)is one of the most popular self-supervised models thateven outperforms supervised baseline models. Given adata point x in a dataset, MoCo samples a positive pair k + and N negative pairs k − , ..., k N − . MoCo is trained withinfoNCE loss (Oord et al., 2018), defined as L in f oNCE = − E x ∼ p ( x ) (cid:104) exp ( F q ( x ) · F k ( k + ) /τ ) (cid:80) Ni = exp ( F q ( x ) · F k ( k i − ) /τ ) (cid:105) , (4)where, F q and F k are neural networks, and τ is a hyper-parameter for temperature. This is a log loss of a softmaxclassifier which minimizes the di ff erence between the rep-resentations F q ( x ) and its positive pair F k ( k + ) while max-imizing the di ff erences between F q ( x ) and negative pairs F k ( k ,..., N − ). Note that minimizing L in f oNCE maximizes thelower bound for mutual information between x and k + (Oord et al., 2018). However, the bound is not tight fora small number of N ; therefore, in practice, we need touse a large number of negative samples for each iteration.However, as this is not practical for computational e ffi -ciency, MoCo maintains a large queue of encoded data.At each training iteration, the entire mini-batch consist-ing of a positive sample and negative samples are inserted7nto the queue. Therefore, we use the entire queue (ex-cept the positive sample) as the set of negatives for theinfoNCE loss. One of the key observation made by MoCois that this can be problematic if the encoder F changestoo quickly, as this would cause the discrepancy betweenthe distribution of the samples in the queue and the newsamples to be quite di ff erent, and the classifier can easilydecrease the loss. To solve this problem, MoCo uses twonetworks: the encoder F q with parameters θ q and the mo-mentum encoder F k with parameters θ k . F k is not trainedwith the infoNCE loss but is updated with momentum pa-rameter m : θ k = m θ k + (1 − m ) θ q , (5)after each training iteration. We use the queue size of8192 and m of 0.999, and adopt multiple augmentationschemes. In each training iteration, for each data x , we randomly i) jitter the brightness, contrast, saturation, andhue by 0.6 ∼ ∼
360 degrees, iii) flipvertically & horizontally, and iv) crop with an area in therange 0.7 ∼ Variational autoencoder (VAE) (Kingma and Welling,2013) is an unsupervised machine learning model that isoften used for dimensionality reduction and image gener-ation. The model contains an encoder and a decoder, witha latent space that has a dimension smaller than the inputdata. The reduction in dimension on the latent space helpsextract the prominent information in the original data. Un-like the vanilla autoencoders, VAE assumes that input datacomes from some latent distribution z ∼ N (0 , I ). The en-coder estimates the mean ( µ ) and variance ( σ ) of the datain the latent space, and the decoder samples a point fromthe distribution for data reconstruction. The assumption of z following a normal distribution and the stochastic prop-erty of the latent vector force the model to create a con-tinuous latent space with similar data closer in the space.This resolves the model overfitting due to irregularitiesin the latent space often observed in the conventional au-toencoder. The learning rule of VAE is to maximize theevidence lower bound ( ELBO ), ELBO = E z ∼ q ( z | x ) log p ( x | z ) − KL ( q ( z | x ) || p ( z )) ≤ log p ( x ) , (6)where, q is the approximate posterior distribution of p ( z | x ). The first term describes the reconstruction loss of the autoencoder model. The second term is can be seenas a regularizer that forces the approximate latent distri-bution to be close to N (0 , I ). Standard stochastic gradientdescent methods cannot directly apply to the model be-cause of the stochastic property of the latent vector. Thesolution, called the reparameterization trick, is to intro-duce a new random variable (cid:15) ∼ N (0 , I ) as the model in-put and set the latent vector to z = µ ( x ) + σ / ( x ) · (cid:15) . Thisallows all model parameters to be deterministic.For our VAE model, we use a ResNet-18 model to en-code input image of size 256 ×
256 to a latent vector of size512. Then, we use the generator from the BigGan model(Brock et al., 2018) to reconstruct the latent vector backto the original image.
The unsupervised learned representations are nowtransferred to the downstream task using limited labeleddata D f l in a task-specific way. It is a common practice tofine-tune the entire pretrained network when the down-stream data is large and similar to original pretraineddata. Hence, we choose to fine-tune all layers in the pre-trained network F pre by initializing with the pretrainedweights to obtain task-specific representations: F f t ( x i ) = W f t F pre ( x i ); where, W f t is the weight for the task-specificlinear layer. Specifically, we fine-tune the entire network(all layers) with limited labels, along with a linear classi-fier or a regressor trained on the top of learned represen-tations to obtain task-specific predictions. The goal of consistency training (CR) is to obtain sim-ilar model predictions for di ff erently augmented versionsof the same input image (Laine and Aila, 2016; Sajjadiet al., 2016). We leverage this idea to further improvethe task-specific (downstream) performance by using thesecond set of unlabeled data D f u in a task-specific semi-supervised manner. In general, most existing SSL ap-proaches utilize the entire task-specific training set D f tofine-tune the pretrained model on the downstream tasks.The main objective of SSL is to develop universal fea-ture representations that can solve a wide variety of taskson many datasets. Although many recent pretraining ap-proaches (Chen et al., 2020; He et al., 2020; Chen et al.,2019; Chaitanya et al., 2020) have shown tremendous suc-cess in both computer vision and medical imaging, but8hey still fail to adapt to the new target tasks. A recentstudy by Zoph et al. (2020) reveals that the value of pre-training diminishes with stronger data augmentation andwith the use of more task-specific label data. Further, theauthors have shown that the self-supervised pretrainingbenefits only when fine-tuned with a limited amount of la-beled data; whereas, the model performance deteriorateswith the use of a more extensive label set. This raises animportant question to “what degree the SSL works andhow much amount of labeled data do we need to fine-tunethe pretrained SSL model”.In our work, we focus on answering the above ques-tion by performing a set of control experiments by varyingthe amount of labeled data in both low-data and high-dataregimes on three di ff erent histology datasets. To this end,we provide an elegant solution based on teacher-student consistency training to improve the downstream perfor-mance by exploiting the unlabeled data in a task-specificsemi-supervised manner.Our teacher-student consistency training (shown inFig.1) has three main steps: i ) We initialize the fine-tuned model F f t as both teacherF t and student F s network; with teacher model weightsbeing frozen across all layers (entire network), exceptthe last linear layer (classifier / regressor); while, studentmodel weights are frozen only until the output of globalaverage pooling layer, with an (MLP with one hiddenlayer and linear classifier / regressor) trained on the top oflearned task-specific feature representations. ii ) We use the teacher network F t to generate pseudo la-bels on the deliberately noised unlabeled data D f u . Next,a student network F s is trained both via standard super-vised loss (on labeled data) and consistency loss (on unla-beled data), i.e., the supervised loss is evaluated by com-paring against the ground-truth labels (cross-entropy (CE)for classification / mean squared error (MSE) for regres-sion task); while, the consistency loss (CE for classifica-tion / MSE for regression task) is obtained by comparingagainst the pseudo labels (i.e., logits for regression / one-hot labels for classification) of the teacher model. iii ) We update the weights of only the student modeland iterate these steps by treating the student as a newteacher after every epoch to relabel the unlabeled data andtrain a new student. In this way, our teacher-student con-sistency approach propagates the label information to theunlabeled data by constraining the model predictions to be consistent with the unlabeled data under di ff erent dataaugmentations.We start by describing our method in the context ofsemi-supervised learning (SmSL) paradigm for the down-stream task. Let us consider the training data D f (fine-tuning set) consisting of total N samples, out of which N are labeled inputs: D f l = { x i , y i } Ni = , and µ N are unlabeled inputs: D f u = { x i } µ Ni = . Where, µ is a hyperparameter thatdetermines the relative ratio of D f l and D f u . In practice, weinclude all labeled instances D f l as a part of unlabeled set,without using their labels, when constructing D f u . Furthernote that, we use di ff erent batch sizes for the labeled andunlabled data such that µ N >> N . Formally, we aim tominimize the following objective (total loss):min θ N (cid:88) i = L s ( F s ( x i ; θ s ) , y i ) + λ L c ( { x i } µ Ni = ; F ( . ) , θ t , η w , θ s , η s ) , (7)where, L s is the supervised loss measured against the la-beled inputs and L c is the consistency loss evaluated be-tween the same unlabeled inputs with di ff erent data aug-mentations. The term λ is the weighting factor which isempirically set as 1, that controls the trade-o ff betweenthe supervised and consistency loss. F ( . ) denotes the Con-vNet model parameterized by θ , with θ t and θ s are theweights of the teacher and student network, respectively;while, η w and η s represents the weak and strong data aug-mentations applied to teacher and student model, respec-tively.Earlier works on consistency training (Sohn et al.,2020; Xie et al., 2019; Tarvainen and Valpola, 2017; Liuet al., 2020; Li et al., 2020) mainly focused on improv-ing the quality of consistency targets (pseudo labels) byusing either of the two strategies: i) careful selection ofdomain-specific data augmentations; or ii) selection ofbetter teacher model rather than the simple replication ofstudent network. However, there exist some limitationswith the above approaches: First, the predicted pseudolabels for the unlabeled data may be incorrect since themodel itself is used to generate them. Suppose, if a higherweight is assigned to these , the quality of learning maybe hampered due to misclassification, and the model maysu ff er from confirmation bias (Arazo et al., 2020). Second,instead of using a converged model (such as pretrained)to generate pseudo labels with high confidence scores, the9odels are trained from scratch leading to lower accuracyand high entropy.In this work, we aim to overcome these limitations byproviding a solution that leverages the advantage of boththe above solutions in a simple, e ffi cient manner. Themain key di ff erence between our approach and the otherexisting consistency training methods are two fold: i) wemake use of the task-specific fine-tuned model to gen-erate high-confidence (i.e., low-entropy) consistency tar-gets instead of relying on the model being trained; ii) weexperimentally show that by aggressively injecting noisethrough various domain-specific data augmentations, thestudent model is forced to work harder to maintain consis-tency with the pseudo-label produced by teacher model.This ensures that the student network doesn’t merelyreplicate the teacher’s knowledge.More formally, we define the consistency loss L c for regression task, as the distance between the prediction ofteacher network F t (with weights θ t and noise η w ) with theprediction of student model F s (with weights θ s and noise η s ): L regressionc = µ N (cid:88) i = E η w ,η s (cid:107) F t ( x i , θ t , η w ) − F s ( x i , θ s , η s ) (cid:107) , (8)where, x i denotes each unlabeled training sample. In con-trast, for classification task, the consistency loss is calcu-lated via standard cross-entropy (CE) loss defined by: L classi f icationc = µ N (cid:88) i = { max( q i ) } H(arg max( q i ) , ˆ q i ) , (9)where, q i = p t ( y i | η w ( x i )) be the predicted class proba-bility by the teacher network F t for input x i applied with weak augmentation ( n w ) and ˆ q i = p s ( y i | η s ( x i )) be thepredicted class probability by the student network F s forinput x i , applied with strong augmentation ( n s ). H(.) de-notes the CE between two probability distributions and arg max ( q i ) is the pseudo-label produced by the teachernetwork on weakly unlabeled input image η w ( x i ). In thiswork, we leverage two kinds of augmentations: weak andstrong. The weak augmentation includes simple horizon-tal flip and cropping; while for strong augmentation, weuse RandAugment technique (Cubuk et al., 2020). Thecomplete list of data augmentations and their parametersettings are listed in Section 4.2. During training, we only update the weights of the stu-dent network while keeping the teacher network weightsfrozen. The weights are updated by learning an MLP(with one hidden layer) and a task-specific linear classi-fier / regressor on the output of the global average poolinglayer, with the rest of the layer weights frozen for the stu-dent network. The idea of fine-tuning the last layers (i.e.,one hidden layer MLP and a linear classifier / regressor)of the student model improves the task-specific perfor-mance by using both labeled and unlabeled data in a task-specific way. This is because the e ff ect of pretraining andmost feature re-use happens in the lowest layers of thenetwork, while fine-tuning higher layers change represen-tations that are well adapted to downstream tasks. Thisobservation was also shown to be consistent in a recentstudy by Raghu et al. (2019). After every epoch, we makethe student as the new teacher F t ← F s and iterate thisprocess until the model converges. The pseudocode forour proposed consistency training is illustrated in Algo-rithm 1.
4. Experiments
We evaluate the e ffi cacy of our method on one re-gression and two classification tasks on histopathologybenchmark datasets, including BreastPathQ (Martel et al.,2019), Camelyon16 (Bejnordi et al., 2017) and Kathermulti-class (Kather et al., 2019). Further, we also showextensive ablation experiments and compare them withstate-of-the-art SSL methods by varying di ff erent percent-ages of labeled data.For baselines, we compare our SSL approach (i.e.,RSP) with two other popular SSL methods, includingthe supervised one: VAE (Kingma and Welling, 2013),MoCo (He et al., 2020), and the random weight initial-ized (supervised). To further evaluate our approach ontask-specific consistency training, we fine-tune the samepretrained models for the second time using di ff erent per-centages of task-specific labeled data. In our experiments,we first initialize the teacher-student model with the fine-tuned SSL model trained on di ff erent percentages of la-beled data: 10%, 25%, 50%, and 100% (depicted as “self-supervised pretraining and supervised fine-tuning” in Ta-ble 3, 4, 5). Next, we train each of these fine-tuned mod-els again for the second time using labeled and unlabeledsamples, again by varying percentages of labeled data,10 lgorithm 1: Consistency training pseudocode
Inputs: D f l = { x i , y i } Ni = , D f u = { x i } µ Ni = µ = ratio of unlabeled data λ = weighting factor for consistency loss F ft = fine-tuned model F t = teacher model with parameter θ t F s = student model with parameter θ s η w = set of weak augmentations η s = set of strong augmentations η = set of augmentations applied to labeled data Initialize: F t ← F ft , with weights frozen across entirenetwork F s ← F ft , with weights frozen until output ofglobal average pooling layer; and training anMLP with one hidden layer + a linearclassifier / regressor for t in [1, num epochs] do for each minibatch B do z Si ∈ B ← F s ( η ( x i ∈ B )) z Ui ∈ µ B ← F t ( η w ( x i ∈ µ B )) ˆ z Ui ∈ µ B ← F s ( η s ( x i ∈ µ B )) q i = p t ( y i | η w ( x i ); θ t ) // prediction computed by F t ˆ q i = p s ( y i | η s ( x i ); θ s ) // prediction computed by F s L classifications = − | B | (cid:80) i ∈ ( B ∩ N ) log z Si [ y i ] // supervised loss for classification L regressions = − | B | (cid:80) i ∈ ( B ∩ N ) (cid:107) z Si − y i (cid:107) // supervised loss for regression L classificationc = | µ B | (cid:80) i ∈ ( B ∩ µ N ) { max( q i ) } H(argmax( q i ) , ˆ q i ) // consistency loss for classification L regressionc = | µ B | (cid:80) i ∈ ( B ∩ µ N ) (cid:107) z Ui − ˆ z Ui (cid:107) // consistency loss for regression loss ← L s + λ L c update θ s using optimizer end F t ← F s // make student as the new teacher and goback to step 2 end return θ s ; and report the final results (depicted as “consistency train-ing (CR)” as shown in Table 3, 4, 5). Note: this exper-imental setting is kept standard across all three datasetsfor a fair evaluation.
The distribution of the number of WSI’s and their cor-responding patches in all three datasets used for our ex-periments is shown in Table 1. In this section, we brieflydescribe all three publicly available datasets, whereas thedata-specific implementations such as pretraining, fine-tuning, and test splits adopted in our experiments are ex-plained in their respective subsections.
BreastPathQ dataset:
This is a publicly availabledataset consisting of hematoxylin and eosin (H&E)stained 96 WSI’s of post-NAT-BRCA specimens (Martelet al., 2019; Peikari et al., 2017), which are scanned at 20 × magnification level (0 . lm / px ). A set of 2579 patcheseach with dimension (512 × Camelyon16 dataset:
We performed classification ofbreast cancer metastases at image level on the datasetfrom Camelyon16 challenge (Bejnordi et al., 2017).This dataset contains 399 H&E stained WSI’s of lymphnodes in the breast, which is split into 270 for train-ing and 129 for testing. The images are acquired fromtwo di ff erent centers scanned at a magnification of 40 × (0 . µ m × . µ m ) and 20 × (0 . µ m × . µ m ) Table 1: The total number of WSI’s and the patches used in each datasetto perform experiments.
Datasets Pretrain Fine-tune TestTrain Validation Train ValidationBreastPathQ WSIs 69 69 25Patches 10000 3000 2063 516 1121Camelyon16 WSIs 60 210 129Patches 62156 10000 306303 40000 –KatherMulti-class WSIs – – –Patches – – 80K 20K 7180
Kather multiclass dataset:
This dataset contains twosubsets of patches containing nine tissue classes: adipose,background, debris, lymphocytes, mucus, smooth mus-cle, normal colon mucosa, cancer-associated stroma, andcolorectal cancer epithelium (Kather et al., 2019). Outof the two subsets, the training set consists of 100K im-age patches of H&E stained colorectal cancer images of(224 × . µ m / pixel spatial reso-lution. In contrast, the set contains 7180 image patches.In this dataset, only patches are made available withoutaccess to WSIs. We perform all our experiments by selecting ResNet-18as the base feature embedding network, with the methodsoutlined in Section 3 on all three datasets. All the exper-iments are performed on 4 Tesla NVIDIA V100 GPUs,and the entire framework is implemented in PyTorch. Wefirst specify the implementation details common to alldatasets, and data specific implementations are providedin Table 2.For self-supervised pretraining : The model is trainedfor 250 epochs with a batch size of 64. We employ(SGD with Nesterov momentum + Lookahead) optimizer(Zhang et al., 2019), with a momentum of 0.9, weightdecay of 1 e − and a constant learning rate of 0.01. ForLookahead, we set k = α = .
5. The best pretrained model is chosen based on thelowest validation loss across BreastPathQ, Camelyon16,and Kather datasets.We use domain-specific data augmentations recom-mended by Tellez et al. (2019a), including rotations, hor-izontal flips, scaling, additive Gaussian noise, brightness,and contrast perturbations, shifting hue and saturationvalues in HSV color space, perturbations in H&E colorspace. We also add random resized crops, blur and a ffi netransformations to the previous list. Specifically, we usea rotation factor between [ − ◦ , + ◦ ], scaling factor be-tween [0.8, 1.2], additive Gaussian noise with [ µ = σ = (0 , . ffi ne transformation with translation, scaleand rotation limit of [0 . , . , ◦ ], respectively, hueand saturation intensity ratio between [-0.1, 0.1] and [-1, 1], respectively, brightness and contrast intensity ra- Table 2: List of hyperparameters used in our experiments across all threedatasets.
Hyperparameters BreastPathQ Camelyon16 KatherMulticlass S up e r v i s e d fi n e - t un i ng Epochs 90 90 90Batch size 4 16 64Learningrate ( lr ) 0.0001 0.0005 0.00001Optimizer Adam β , β = (0 . , . w d = e − − SGD withNesterov momentumof 0.9, w d = e − − Adam β , β = (0 . , . w d = e − − Scheduler MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1Selection ofbest model Lowest validationloss Highest validationaccuracy Highest validationaccuracy C on s i s t e n c y t r a i n i ng Epochs 90 90 90Batch size 4 8 8Ratio ofunlabeled data( µ ) 7 7 7Learningrate ( lr ) 0.0001 0.0005 0.00001Optimizer Adam β , β = (0 . , . w d = e − − SGD withNesterov momentumof 0.9, w d = e − − Adam β , β = (0 . , . w d = e − − Scheduler MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1 MultiStep with lr decay at [30, 60]epochs by 0.1Selection ofbest model Lowest validationloss Highest validationaccuracy Highest validationaccuracy tios between [-0.2, 0.2], blurring the input image usinga random-sized kernel in the range [3, 7], and randomlyresizing and cropping the image patch to its original im-age size. Finally, we perturb the intensity of hematoxylinand eosin (HED) color channels with a factor of [-0.035,0.035]. We apply all these transformations in sequence byrandomly selecting them in each mini-batch to obtain adiverse set of training images.For supervised fine-tuning : We fine-tune the entirepretrained SSL model (all layers) with a linear classifieror a regressor trained on top of the learned representa-tions, with limited labels (10%, 25%, 50% and 100% oflabeled examples), to directly evaluate the performance ofRSP, VAE, and MoCo models. In particular, for RSP, wediscard entirely the last MLP with one hidden layer afterpretraining and fine-tune with a linear layer on the top of d = (256 × =
768 dimensional ( d ) embedding followedby softmax to obtain task-specific predictions. However,for VAE and MoCo, we fine-tune with a linear layer onthe 512 − d feature vector.For fine-tuning, we use di ff erent sets of hyperparame-12ers for all three datasets, which are provided in Table 2.Further, we include a simple set of augmentations ( η asdepicted in Algorithm 1), such as rotation, scaling, andrandom resized crops. For rotation and scaling, we use afactor of [ − ◦ , + ◦ ] and [0.8, 1.2], respectively, and werandomly resize and crop the image patch to its originalimage size.For consistency training : We use a semi-supervisedapproach for consistency training by using labeled andunlabeled examples in a task-specific manner. We adoptthe same task-specific fine-tuned model as both teacherand student network, with teacher network weights frozen(all layers); while training a student network with onehidden layer MLP and a task-specific linear layer (clas-sifier / regressor) on the output of global average poolinglayer (with rest of the layer weights frozen). All the hy-perparameters related to consistency training are shownin Table 2. In our experiments, we initialize the teacher-student model with the fine-tuned SSL model trained ondi ff erent percentages of labeled data (10%, 25%, 50%,and 100%). Next, we train each of these fine-tuned mod-els again for the second time using labeled and unlabeledsamples, again by varying percentages of labeled data,and report the final results.In our work, we use two kinds of augmentations forconsistency training: “weak” and “strong” augmentationfor teacher and student network, respectively. For theteacher network, we employ simple transformations suchas horizontal flip and random cropping to its original im-age size as weak augmentations. Whereas for the studentnetwork, we adopt a similar set of transformations to thepretraining stage, but with di ff erent hyperparameters tostrengthen the augmentation severity, which is referredto as strong augmentations. The following are the list ofaugmentations with di ff erent parameters to the pretrain-ing stage: an a ffi ne transformation with translation limitof [0.01, 0.1], scale limit of [0.51, 0.60] and rotation of90 ◦ , HSV intensity ratio between [-1, 1] and blurring theinput image using a random-sized kernel in the range [5,7]. We apply these augmentations in sequential by ran-domly selecting them in each mini-batch using the Ran-dAugment technique (Cubuk et al., 2020). In our experi-ments, we use ( N Aug = , M g = [1 , N Aug denotes the number of augmentations to ap-ply sequentially in each mini-batch, and M g is the mag- nitude that is sampled within a pre-defined range [1, 10]that controls the severity of distortion in each mini-batch. In this experiment, we train our approach to automat-ically quantify tumor cellularity (TC) scores in digitizedslides of breast cancer images for tumor burden assess-ment. TC score is defined as the percentage of the totalarea occupied by the malignant tumor cells in a given im-age patch (Peikari et al., 2017). For pretraining the SSLapproach, we adopted 69 WSI’s of the training set, fromwhich we randomly extract patches of size (256 × × , 10 × , and 5 × magnification for RSP, while for VAEand MoCo, patches are extracted at 20 × magnification.We perform fine-tuning by resizing the image patches to(256 × ffi cient (ICC) valuesbetween the proposed methods and the two pathologistsA and B.Table 3 presents the ICC values for di ff erent method-ologies, and the corresponding TC scores produced byeach method on sample WSIs of the BreastPathQ testset are shown in Fig. A.3 (shown in Appendix). Theconsistency training (CR) improved the results of self-supervised pretrained models (VAE, MoCo, and RSP) bya 3% increase in ICC values. Further, all SSL and CRmethods (VAE, MoCo, and RSP) seem to exhibit opti-mal performance, which is close or even outperforming tothat of supervised baseline (random) on all training sub-sets. Among all the methods, the RSP + CR achieves thebest score of greater than 0 .
90, which even surpassed theintra-rater agreement score of 0 .
89 (Akbar et al., 2019).Besides, our obtained TC score of 0 .
90 on the Breast-PathQ test set is superior to state-of-the-art (SOTA) meth-ods (Akbar et al., 2019; Rakhlin et al., 2019), with amaximum score of 0.883. Specifically, our RSP + CR ap-proach achieves a minimum of 4% greater ICC value13 able 3: Results on BreastPathQ dataset. Predicting the percentage of tumor cellularity (TC) at patch-level (intra-class correlation (ICC) coe ffi cientsbetween two pathologists A and B). The 95% confidence intervals (CI) are shown in square brackets. We bold the best results. Methods% Training Data ICC Coe ffi cient (95% CI)Pathologist A Pathologist B Pathologist A Pathologist B Pathologist A Pathologist B Pathologist A Pathologist B10% (206 labels) 25% (516 labels) 50% (1031 labels) 100% (2063 labels)Self-supervised pretraining + Supervised fine-tuningRandom 0.697 [0.67, 0.73] 0.637 [0.60, 0.67] 0.786 [0.76, 0.81] 0.727 [0.70, 0.75] 0.812 [0.79, 0.83] 0.797 [0.77, 0.82] 0.863 [0.85, 0.88] 0.843 [0.83, 0.86]VAE
RSP (ours)
Consistency training (CR)Random + CR 0.658 [0.62, 0.69] 0.630 [0.59, 0.66] 0.818 [0.80, 0.84] 0.802 [0.78, 0.82] 0.847 [0.83, 0.86] 0.839 [0.82, 0.86] 0.891 [0.88, 0.90] 0.891 [0.88, 0.90]VAE + CR 0.771 [0.75, 0.79] 0.727 [0.70, 0.75] 0.842 [0.82, 0.86] 0.826 [0.81, 0.84] 0.866 [0.85, 0.88] 0.857 [0.84, 0.87] 0.884 [0.87, 0.90] 0.864 [0.85, 0.88]MoCo + CR 0.808 [0.79, 0.83] 0.803 [0.78, 0.82] 0.872 [0.86, 0.89]
RSP + CR (ours) 0.876 [0.86, 0.89] 0.846 [0.83, 0.86] 0.873 [0.86, 0.89] than VAE + CR and MoCo + CR, and at least 17% improve-ment in ICC value to the supervised baseline, trained on10% labeled set ( ≈
206 image patches). In contrast, ona complete training set, all CR methods exhibit compet-itive / similar performance. This indicates that the consis-tency training improves upon self-supervised pretrainingpredominantly in the low-data regime. This experiment is a slide-based binary classificationtask to identify the presence of lymph node metastasis inWSIs using only slide-level labels. To experiment with thelimited annotations, we first perform self-supervised pre-training on 60 WSI’s (35 normal and 25 tumor), which areset aside from the original training set. For pretraining, werandomly extract patches of size (256 × × , 20 × ,and 10 × magnification for RSP, while for VAE and MoCo,patches are extracted at 40 × magnification. Further, thedownstream fine-tuning is performed on the randomly ex-tracted patches of size (256 × + +
20K normal) for validation. We finally eval-uate the methods on 129 WSI’s of the test set (as shown inTable 1). We divide the fine-tuning set containing 306.3Kpatches into four incremental subsets of [10%, 25%, 50%,100%] containing [30.6K, 76.5K, 153.1K, 306.3K] imagepatches, respectively.We follow the same post-processing steps as Wanget al. (2016) to obtain slide-level predictions. We first trainour proposed models to discriminate patch-level tumor vs. normal patches. We then aggregate these patch-level pre-dictions to create a heat map of tumor probability over theslide. Next, we extract several features similar to Wanget al. (2016) from the heat map and train a slide-level sup-port vector machine (SVM) classifier to make the slide-level prediction. We compare and evaluate all three SSLpretrained and CR methods with the corresponding super-vised baseline. The method’s performance is evaluated interms of area under the receiver operating characteristiccurve (AUC) on a test set containing 129 WSIs. In ad-dition, we also evaluate the binary classification perfor-mance (accuracy (Acc)) on the patch-level data containing40K patches (20K tumor +
20K normal) of the validationtest. Further, we perform the statistical significance test bycomparing the pairs of AUCs between consistency train-ing and SSL methods using the two-tailed Delong’s test(Sun and Xu, 2014). All di ff erences in AUC value with a p -value < .
05 were considered as significant.Table 4 presents the AUC scores for predicting slide-level tumor metastasis using di ff erent methodologies. On10% label regime, RSP and MoCo methods outperformedthe supervised baseline, whereas the performance of VAEis significantly decreased compared to other methods.Further, the RSP + CR approach significantly outperformsthe RSP by a margin of 2% on 10% and 25% labeled set.The proposed RSP + CR achieves the best score of 0.917using 25% labeled set ( ≈
76K patches) compared to thewinning method in Camelyon16 challenge (Wang et al.,2016), which obtained an AUC of 0.925 using the fullysupervised model trained on millions of image patches.Compared with the unsupervised representation learningmethods proposed in Tellez et al. (2019b), our RSP + CR14 able 4: Results on Camelyon16 dataset. Predicting the presence of tumor metastasis at WSI level (AUC) and patch-level classification performance(accuracy). The DeLong method (Sun and Xu, 2014) was used to construct 95% CIs, which are shown in square brackets. The best scores are shownin bold. Note: the patch-level accuracy (Acc) is reported on 40K patches of the validation set.
Methods% Training Data AUC Acc AUC Acc AUC Acc AUC Acc10% (30630 labels) (4000 labels) 25% (76576 labels) (10000 labels) 50% (153151 labels) (20000 labels) 100% (306303 labels) (40000 labels)Self-supervised pretraining + Supervised fine-tuningRandom 0.804 [0.72 - 0.89]
VAE 0.737 [0.64 - 0.83] 0.827 0.814 [0.73 - 0.89] 0.864 0.830 [0.75 - 0.91] 0.906 0.818 [0.73 - 0.90] 0.907MoCo
RSP (ours) + CR 0.659 [0.54 - 0.77]
VAE + CR 0.633 [0.55 - 0.72] 0.828 0.719 [0.63 - 0.81] 0.863 0.741 [0.64 - 0.84] 0.918 0.779 [0.69 - 0.87] 0.928MoCo + CR 0.728 [0.63 - 0.82] 0.835 0.742 [0.64 - 0.84] 0.902 0.766 [0.67 - 0.86] 0.929 0.825 [0.75 - 0.90] 0.946
RSP + CR (ours) 0.855 [0.78 - 0.92] approach trained on 10% labels ( ≈
30K patches) out-performs their top-performing BiGAN method by 13%higher AUC trained on 50K labeled samples. Addition-ally, we also evaluated our methods’ performance on thevalidation set containing 40K patches (20K tumor + + CR) outperformed the RSP, RSP + CR methodsby a slight margin di ff erence of 0.5% Acc on all percent-ages of training subsets.Most importantly, from our experiments on the Came-lyon16 dataset, we draw several insights on the generalityof our approach on low- and high-labels training scenar-ios. On a low-label data regime, i.e., the patch-wise clas-sification task on the validation set, which has training la-bels ranging from 4K to 40K, we observe that adding con-sistency training improved the SSL model performanceup to 2% increase in Acc values. AUCs of consistencytrained models are statistically higher than AUCs of SSLpretrained models with p -value < .
02, across 10% and25% labeled set. As we increase the number of labeledsamples (50% to 100%), adding the consistency trainingto the Random, VAE, and MoCo SSL pretrained modelsresulted in a noticeable drop in AUC values. The resultsfor the RSP model still improved after consistency train-ing in the high-label data regime, but these di ff erenceswere not statistically significant. Thus, in general, our ap-proach has been shown to work well in a limited annota-tion setting, which is highly beneficial in the histopathol-ogy domain.Further, we also observe that pretraining performanceslightly diminishes with an increase in the amount of la- beled data (from 10% (30K) to 100% (306K) labels),which essentially deteriorates the value of pretrained fea-tures and is consistent with the recent study by Zoph et al.(2020). Overall, our consistency training approach contin-ues to improve the task-specific performance only whentrained with low-label data, and it is additive to pretrain-ing.Fig. A.4 (shown in Appendix) highlight the tumor prob-ability heat-maps produced by di ff erent methodologies.Visually all self-supervised pretrained methods (VAE,MoCo, and RSP) were shown to focus on tumor areaswith high probability, while the supervised baseline ex-hibits slightly lower probability values for the same tu-mor regions. We observe that most methods successfullyidentify the macro-metastases (Row 1-3), with a tumor di-ameter larger than 2mm, with an excellent agreement withthe ground truth annotation. However, the same methodsstruggle to precisely identify the micro-metastases (Row4), with tumor diameter smaller than 2mm, which is gen-erally challenging even for the fully-supervised models. Due to the unavailability of access to WSIs in thisdataset, we could not perform self-supervised pretrainingon this dataset. However, instead, we used the SSL pre-trained model of Camelyon16 to fine-tune and evaluatethe patch-level performance for feature transferability be-tween datasets with di ff erent tissue types / organs and reso-lution protocols. In our experiments, the downstream fine-tuning is performed on 100k image patches of the training15et and tested on 7180 images of the test set by resizingthe patches to (256 × F score ( F ) for classification of 9 colorec-tal tissue classes using di ff erent methodologies. On thisdataset, the MoCo + CR approach obtains a new state-of-the-art result with an Acc of 0.990, weighted F scoreof 0.953, and a macro AUC of 0.997, compared to theprevious method (Kather et al., 2019) which obtained anAcc of 0.943. This underscores that our pretrained ap-proaches are more generalizable to unseen domains withdi ff erent organs, tissue types, staining, and resolution pro-tocols. All the consistency trained methods marginallyoutperform the SSL pretrained models on all subsetsof a labeled set. Further, the CR methods (RSP + CR,MoCo + CR, VAE + CR) outperform the supervised base-line by 3% and 17% increase in Acc and F score, respec-tively. Compared to the previous representation learningmethods (Pati et al., 2020), our approach obtains 3% im-provement in Acc by training on just 10% labels, com-pared to the previous method (Acc of 0.951) trained using100% labels. Thus, in general, our approach can be ap-plied to other domain-adaptation problems in histopathol-ogy, where target annotations are often limited or some-times unavailable.
5. Ablation studies
In this section, we perform the ablation experiments tostudy the importance of three components of our method:(i) ratio of unlabeled data; (ii) impact of strong augmen-tations on student network; (iii) convergence behavior ofconsistency training. We choose to perform these ablationstudies on 10% labeled data on BreastPathQ and Came-lyon16 datasets due to time constraints. Further, we ex-clude the Kather Multiclass dataset, as it was used toevaluate the feature transferability between datasets, thusmaking it less suitable for this extensive study.
The success of consistency training is mainly attributedto the amount of unlabeled data. From Table 6, we observea marginal to noticeable improvement in performance aswe increase the ratio of unlabeled to labeled batch size( µ ). This is consistent with the recent studies in Xie et al.(2019) and Sohn et al. (2020). For each fold increase in the ratio between unlabeled and labeled samples, the per-formance improves by at least 2% on both BreastPathQand Camelyon16. However, the performance in Breast-PathQ is quite negligible since the number of trainingsamples (2063 patches) is substantially less compared toCamelyon16 (306K patches). On the other hand, increas-ing the ratio of unlabeled data while fine-tuning the pre-trained model tends to converge faster than training themodel from scratch. In essence, a large amount of unla-beled data is beneficial to obtain better performance dur-ing consistency training. The success of teacher-student consistency training iscrucially dependent on the di ff erent strong augmentationpolicies applied to the student network. Table 7 depictsthe analysis of the impact of augmentation policies on fi-nal performance. In our experiments, we apply each ofthese augmentations sequentially by randomly selectingthem in each mini-batch using the RandAugment (Cubuket al., 2020) technique. We vary the total number of aug-mentations ( N Aug ) from value 1 to 7 and examine the ef-fect of strong augmentation policies (applied to the stu-dent network) during consistency training. From Table7, we observe that as we gradually increase the severityof augmentation policies in the student model, there aremarginal to noticeable improvements in the performancegain. This improvement is mainly visible when trained onlarge amounts of unlabeled data (such as Camelyon16),where there is a minimum 3% improvement in AUC aswe increase the augmentation strength. This suggests thatadding strong augmentations to the student network is es-sential to avoid the model being learned just the teacher’sknowledge and gain further improvements in task-specificperformance.
6. Discussions
With the advancements in deep learning techniques,current histopathology image analysis methods haveshown excellent human-level performance on varioustasks such as tumor detection (Campanella et al., 2019),cancer grading (Bulten et al., 2020), and survival predic-tion (Wulczyn et al., 2020), etc. However, to achieve thesesatisfactory results, these methods require a large amountof labeled data for training. Acquiring such massive anno-tations is laborious and tedious in practice. Thus, there is a16 able 5: Results on Kather Multiclass dataset. Classification of nine tissue types at patch-level (accuracy (Acc), weighted F score ( F )). Thisexperiment is performed to assess the generalizability of pretrained features between di ff erent tissue types and resolutions. Pretraining is performedon Camelyon16 (Breast) and tested on Kather Multiclass (Colon). We bold the best results. Methods% Training Data Acc F Acc F Acc F Acc F
10% (8000 labels) 25% (20000 labels) 50% (40000 labels) 100% (80000 labels)Self-supervised pretraining + Supervised fine-tuningRandom 0.972 0.873 0.974 0.885 0.979 0.905 0.983 0.920VAE 0.963 0.835 0.972 0.885 0.980 0.908
MoCo
RSP (ours) + CR 0.938 0.670 0.943 0.735 0.941 0.723 0.939 0.707VAE + CR 0.972 0.876 0.979 0.906 0.978 0.903 0.982 0.915MoCo + CR RSP + CR (ours)
Table 6: Impact of ratio of unlabeled data ( µ ). These experiments areperformed with N =
10% labeled and µ N unlabled samples in eachmini-batch. Note: the intra-class correlation (ICC) coe ffi cient is evalu-ated between two pathologists A and B for BreastPathQ. Ratio ofUnlabeled Data( µ ) BreastPathQ Camelyon16ICC ( P A ) ICC ( P B ) AUC Acc1 0.871 0.851 0.738 0.9042 0.871 0.851 0.785 0.9033 0.876 0.846 0.797 0.9074 0.876 0.856 0.803 0.9115 0.880 0.861 0.810 0.9146 great potential to explore self / semi-supervised approachesthat can alleviate the annotation burden by e ff ectively ex-ploiting the unlabeled data. Drawing on this spirit, in thiswork, we propose a self-supervised driven consistencytraining method for histology image analysis by lever-aging the unlabeled data in both task-agnostic and task-specific manner. We first formulate the self-supervisedpretraining as the resolution sequence prediction task thatlearns meaningful visual representations across multipleresolutions in WSI. Next, a teacher-student consistencytraining is employed to improve the task-specific per-formance based on prediction consistency with the un-labeled data. Our method is validated on three publichistology datasets, i.e., BreastPathQ, Camelyon16, andKather Multiclass, in which our method consistently out-performs other self-supervised methods and also with thesupervised baseline under a limited-label regime. Ourmethod has also shown its e ffi cacy in transferring pre- Table 7: Impact of strong augmentation policies applied to the studentnetwork. These number of possible transformations are applied sequen-tially by randomly selecting them in each mini-batch.
No of PossibleTransformations( N Aug ) BreastPathQ Camelyon16ICC ( P A ) ICC (P B ) AUC Acc1 0.883 0.863 0.569 0.8952 trained features across di ff erent datasets with di ff erent tis-sue types / organs and resolution protocols.Despite the excellent performance of our method, thereis one main limitation: i.e., if the pseudo labels producedby the teacher network are inaccurate, then the studentnetwork is forced to learn from incorrect labels leadingto confirmation bias (Arazo et al., 2020). As a result, thestudent may not become better than the teacher duringconsistency training. We solved this issue with RandAug-ment (Cubuk et al., 2020), a strong data augmentationtechnique, which we combine with label smoothing (softpseudo labels) to deal with confirmation bias. This is alsoconsistent with the recent study (Arazo et al., 2020) thatshowed soft pseudo labels outperform hard pseudo labelswhen dealing with label noise. However, the bias issuestill persists with soft pseudo labels in our application.This is prominently visible in our method, where, com-pared to self-supervised pretraining (see Fig. A.4, col-umn (c) - (f); Fig. A.3, column (b) - (e)), the consis-17ency trained approaches (see Fig. A.4, column (g) - (j);Fig. A.3, column (f) - (i)) exhibits some low probability( < .
5) spurious pixels outside the malignant cell bound-aries. This happens because of the naive pseudo labelingproduced by the teacher network, which sometimes over-fits to incorrect pseudo labels. Further, this issue is rein-forced when we attempt to train the student network onunlabeled samples with incorrect pseudo labels leading toconfirmation bias. One solution to mitigate this issue isto make the teacher network constantly adapt to the feed-back of the student model instead of the teacher modelbeing fixed. This has shown to work well in a recent metapseudo label technique (Pham et al., 2020), where bothteacher and student are trained in parallel by making theteacher learn from the reward signal of the student perfor-mance on a labeled set. Exploring this idea is beyond thescope of this work, and we will leave this to the practi-tioner to explore more along these lines.In general, our proposed self-supervised driven consis-tency training framework has a great potential to solvethe majority of both classification and regression tasks incomputational histopathology, where annotation scarcityis a significant issue. Further, our pretrained representa-tions are more generic and can be easily extended to otherdownstream multi-tasks, such as segmentation and sur-vival prediction. It is worth investigating further to de-velop a universal feature encoder in histopathology thatcan solve many tasks without the need for excessive la-beled annotations.
7. Conclusion
In this paper, we present an annotation e ffi cient frame-work by introducing a novel self-supervised driven con-sistency training paradigm for histopathology image anal-ysis. The proposed framework utilizes the unlabeleddata both in a task-agnostic and task-specific mannerto significantly advance the accuracy and robustness ofthe state-of-the-art self-supervised (SSL) methods. Tothis end, we first propose a novel task-agnostic self-supervised pretext task by e ffi ciently harnessing the multi-resolution contextual cues present in the histology whole-slide images. We further develop a task-specific teacher-student semi-supervised consistency method to e ff ec-tively distill the SSL pretrained representations to down-stream tasks. This synergistic harness of unlabeled data has been shown to improve the SSL pretrained perfor-mance, over its supervised baseline, under a limited-labelregime. Extensive experiments on three public bench-mark datasets across two classification and one regressionbased histopathology tasks, i.e., tumor metastasis detec-tion, tissue type classification, and tumor cellularity quan-tification, demonstrates the e ff ectiveness of our proposedapproach. Our experiments also showed that our method’sperformance is significantly outperforming or even com-parable to that of the supervised baseline when trainedunder limited annotation settings. Furthermore, our ap-proach is more generic and has been shown to gener-ate universal pretrained representations that can be eas-ily adapted to other histopathology tasks and also to otherdomains without any modifications. Conflict of interest
ALM is co-founder and CSO of Pathcore. CS, SK and FChave no financial or non-financial conflict of interests.
Acknowledgment
This research is funded by: Canadian Cancer Soci-ety (grant number 705772); National Cancer Instituteof the National Institutes of Health [grant numberU24CA199374-01]; Canadian Institutes of Health Re-search.
Appendix A. Supplementary material • Fig. A.3 Tumor cellularity scores produced on WSIsof the BreastPathQ test set for 10% labeled data. • Fig. A.4 Tumor probability heat-maps overlaid onoriginal WSIs from Camelyon16 test set predictedfrom 10% labeled data.
References
Akbar, S., Peikari, M., Salama, S., Panah, A.Y., Nofech-Mozes, S., Martel, A.L., 2019. Automated and manualquantification of tumour cellularity in digital slides fortumour burden assessment. Scientific Reports 9, 1–9.18a) (b) (c) (d) (e) (f) (g) (h) (i)
Figure A.3: TC scores produced on WSIs of the BreastPathQ test set for 10% labeled data. (a) Original WSI overlaid with ground truth mask (annotation labels with pinksquare boxes denote 0% cellularity and green square boxes indicate 100% cellularity); (b) – (e) corresponds to TC score produced by random (supervised), VAE, MoCo, andRSP approach, respectively; (f) – (i) corresponds to TC score produced by random + CR (supervised), VAE + CR, MoCo + CR, and RSP + CR methods, respectively. The colorblue denotes healthy (0% TC), and red denotes malignant (100% TC). a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
Figure A.4: Tumor probability heat-maps overlaid on original WSIs from Camelyon16 test set predicted from 10% labeled data. (a) Original WSI; (b) Ground truth annotationmask; (c) – (f) corresponds to tumor probability produced by random (supervised), VAE, MoCo, and RSP approach, respectively; (g) – (i) corresponds to tumor probabilityproduced by random + CR (supervised), VAE + CR, MoCo + CR, and RSP + CR methods, respectively. The first three rows correspond to examples of macro-metastases (tumorcell cluster diameter ≥ mm ), while the last row corresponds to micro-metastases (tumor cell cluster diameter from > . mm to < mm ). The color blue denotes healthyregions, and red denotes tumor regions. razo, E., Ortego, D., Albert, P., O’Connor, N.E.,McGuinness, K., 2020. Pseudo-labeling and confir-mation bias in deep semi-supervised learning, in: 2020International Joint Conference on Neural Networks(IJCNN), pp. 1–8.Aviles-Rivero, A.I., Papadakis, N., Li, R., Sellars, P., Fan,Q., Tan, R.T., Sch¨onlieb, C.B., 2019. Graphx-net:Chest x-ray classification under extreme minimal su-pervision. arXiv preprint arXiv:1907.10085 .Bai, W., Chen, C., Tarroni, G., Duan, J., Guitton, F., Pe-tersen, S.E., Guo, Y., Matthews, P.M., Rueckert, D.,2019. Self-supervised learning for cardiac mr imagesegmentation by anatomical position prediction, in: In-ternational Conference on Medical Image Computingand Computer-Assisted Intervention, pp. 541–549.Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken,B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A.,Hermsen, M., Manson, Q.F., Balkenhol, M., et al.,2017. Diagnostic assessment of deep learning al-gorithms for detection of lymph node metastases inwomen with breast cancer. JAMA 318, 2199–2210.Bera, K., Schalper, K.A., Rimm, D.L., Velcheti, V., Mad-abhushi, A., 2019. Artificial intelligence in digitalpathology—new tools for diagnosis and precision on-cology. Nature Reviews Clinical Oncology 16, 703–715.Blendowski, M., Nickisch, H., Heinrich, M.P., 2019. Howto learn from unlabeled volume data: self-supervised 3dcontext feature learning, in: International Conferenceon Medical Image Computing and Computer-AssistedIntervention, pp. 649–657.Brock, A., Donahue, J., Simonyan, K., 2018. Large scalegan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096 .Bulten, W., Pinckaers, H., van Boven, H., Vink, R., de Bel,T., van Ginneken, B., van der Laak, J., Hulsbergen-van de Kaa, C., Litjens, G., 2020. Automated deep-learning system for gleason grading of prostate cancerusing biopsies: a diagnostic study. The Lancet Oncol-ogy 21, 233–241. Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A.,Silva, V.W.K., Busam, K.J., Brogi, E., Reuter, V.E.,Klimstra, D.S., Fuchs, T.J., 2019. Clinical-grade com-putational pathology using weakly supervised deeplearning on whole slide images. Nature Medicine 25,1301–1309.Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E., 2020.Contrastive learning of global and local features formedical image segmentation with limited annotations.arXiv preprint arXiv:2006.10511 .Chapelle, O., Schlkopf, B., Zien, A., 2010. Semi-Supervised Learning. 1st ed., The MIT Press.Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara,M., Rueckert, D., 2019. Self-supervised learning formedical image analysis using image context restora-tion. Medical Image Analysis 58, 101539.Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020.A simple framework for contrastive learning of visualrepresentations. arXiv preprint arXiv:2002.05709 .Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V., 2020.Randaugment: Practical automated data augmentationwith a reduced search space, in: Proceedings of theIEEE / CVF Conference on Computer Vision and Pat-tern Recognition Workshops, pp. 702–703.Diaz-Pinto, A., Colomer, A., Naranjo, V., Morales, S.,Xu, Y., Frangi, A.F., 2019. Retinal image synthe-sis and semi-supervised learning for glaucoma assess-ment. IEEE Transactions on Medical Imaging 38,2211–2218.Donahue, J., Kr¨ahenb¨uhl, P., Darrell, T., 2016. Adversar-ial feature learning. arXiv preprint arXiv:1605.09782.Dumoulin, V., Belghazi, I., Poole, B., Mastropietro,O., Lamb, A., Arjovsky, M., Courville, A., 2016.Adversarially learned inference. arXiv preprintarXiv:1606.00704 .French, G., Mackiewicz, M., Fisher, M., 2017. Self-ensembling for visual domain adaptation. arXivpreprint arXiv:1706.05208 .21oyal, P., Mahajan, D., Gupta, A., Misra, I., 2019. Scal-ing and benchmarking self-supervised visual represen-tation learning, in: Proceedings of the IEEE Interna-tional Conference on Computer Vision, pp. 6391–6400.He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Mo-mentum contrast for unsupervised visual representationlearning, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 9729–9738.Hinton, G., Vinyals, O., Dean, J., 2015. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 .Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.,2016. Deep networks with stochastic depth, in: Euro-pean Conference on Computer Vision, pp. 646–661.Javed, S., Mahmood, A., Fraz, M.M., Koohbanani, N.A.,Benes, K., Tsang, Y.W., Hewitt, K., Epstein, D., Snead,D., Rajpoot, N., 2020. Cellular community detectionfor tissue phenotyping in colorectal cancer histologyimages. Medical Image Analysis , 101696.Jing, L., Tian, Y., 2020. Self-supervised visual featurelearning with deep neural networks: A survey. IEEETransactions on Pattern Analysis and Machine Intelli-gence , 1–1.Kather, J.N., Krisam, J., Charoentong, P., Luedde, T., Her-pel, E., Weis, C.A., Gaiser, T., Marx, A., Valous, N.A.,Ferber, D., et al., 2019. Predicting survival from col-orectal cancer histology slides using deep learning: Aretrospective multicenter study. PLoS medicine 16,e1002730.Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.,2020. Distribution aligning refinery of pseudo-label forimbalanced semi-supervised learning. arXiv preprintarXiv:2007.08844 .Kingma, D.P., Welling, M., 2013. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 .Koch, G., Zemel, R., Salakhutdinov, R., 2015. Siameseneural networks for one-shot image recognition, in:ICML Deep Learning Workshop, pp. 1–8. Laine, S., Aila, T., 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.hyun Lee, D., 2013. Pseudo-label: The simple and e ffi -cient semi-supervised learning method for deep neuralnetworks, in: Workshop on challenges in Representa-tion Learning, ICML, pp. 1–6.Li, K., Wang, S., Yu, L., Heng, P.A., 2020a. Dual-teacher:Integrating intra-domain and inter-domain teachersfor annotation-e ffi cient cardiac segmentation. arXivpreprint arXiv:2007.06279 .Li, X., Jia, M., Islam, M.T., Yu, L., Xing, L., 2020b. Self-supervised feature learning via exploiting multi-modaldata for retinal disease diagnosis. IEEE Transactionson Medical Imaging .Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng, P.A.,2020c. Transformation-consistent self-ensemblingmodel for semisupervised medical image segmenta-tion. IEEE Transactions on Neural Networks andLearning Systems .Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng,P.A., 2020. Transformation-consistent self-ensemblingmodel for semisupervised medical image segmenta-tion. IEEE Transactions on Neural Networks andLearning Systems , 1–12.Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A.,Ciompi, F., Ghafoorian, M., Van Der Laak, J.A.,Van Ginneken, B., S´anchez, C.I., 2017. A survey ondeep learning in medical image analysis. Medical Im-age Analysis 42, 60–88.Liu, Q., Yu, L., Luo, L., Dou, Q., Heng, P.A., 2020. Semi-supervised medical image classification with relation-driven self-ensembling model. IEEE Transactions onMedical Imaging .Lu, M.Y., Chen, R.J., Wang, J., Dillon, D., Mahmood, F.,2019. Semi-supervised histology classification usingdeep multiple instance learning and contrastive predic-tive coding. arXiv preprint arXiv:1910.10825 .Madabhushi, A., Lee, G., 2016. Image analysis and ma-chine learning in digital pathology: Challenges and op-portunities. Medical Image Analysis 33, 170–175.22artel, A.L., Nofech-Mozes, S., Salama, S., Ak-bar, S., Peikari, M., 2019. Assessment of resid-ual breast cancer cellularity after neoadjuvantchemotherapy using digital pathology [data set].https: // doi.org / / TCIA.2019.4YIBTJNO.Miyato, T., Maeda, S.i., Koyama, M., Ishii, S., 2018. Vir-tual adversarial training: a regularization method forsupervised and semi-supervised learning. IEEE Trans-actions on Pattern Analysis and Machine Intelligence41, 1979–1993.Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representa-tion learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 .Pati, P., Foncubierta-Rodr´ıguez, A., Goksel, O., Gabrani,M., 2020. Reducing annotation e ff ort in digital pathol-ogy: A co-representation learning framework for clas-sification tasks. Medical Image Analysis 67, 101859.Peikari, M., Salama, S., Nofech-Mozes, S., Martel, A.L.,2017. Automatic cellularity assessment from post-treated breast surgical specimens. Cytometry Part A91, 1078–1087.Pham, H., Xie, Q., Dai, Z., Le, Q.V., 2020. Meta pseudolabels. arXiv preprint arXiv:2003.10580 .Quiros, A.C., Murray-Smith, R., Yuan, K., 2019. Pathol-ogy gan: learning deep representations of cancer tissue.arXiv preprint arXiv:1907.02644 .Raghu, M., Zhang, C., Kleinberg, J., Bengio, S., 2019.Transfusion: Understanding transfer learning for med-ical imaging, in: Advances in Neural Information Pro-cessing Systems, pp. 3347–3357.Rakhlin, A., Tiulpin, A., Shvets, A.A., Kalinin, A.A.,Iglovikov, V.I., Nikolenko, S., 2019. Breast tumorcellularity assessment using deep neural networks, in:Proceedings of the IEEE International Conference onComputer Vision Workshops, pp. 0–0.Rebu ffi , S.A., Ehrhardt, S., Han, K., Vedaldi, A., Zisser-man, A., 2020. Semi-supervised learning with scarceannotations, in: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition Work-shops, pp. 762–763. Sajjadi, M., Javanmardi, M., Tasdizen, T., 2016. Regu-larization with stochastic transformations and perturba-tions for deep semi-supervised learning, in: Advancesin Neural Information Processing Systems, pp. 1163–1171.Shi, X., Su, H., Xing, F., Liang, Y., Qu, G., Yang,L., 2020. Graph temporal ensembling based semi-supervised convolutional neural network with noisy la-bels for histopathology image analysis. Medical ImageAnalysis 60, 101624.Sohn, K., Berthelot, D., Li, C.L., Zhang, Z., Carlini,N., Cubuk, E.D., Kurakin, A., Zhang, H., Ra ff el, C.,2020. Fixmatch: Simplifying semi-supervised learn-ing with consistency and confidence. arXiv preprintarXiv:2001.07685 .Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S.,Dickscheid, T., 2018. Improving cytoarchitectonic seg-mentation of human brain areas with self-supervisedsiamese networks, in: International Conference onMedical Image Computing and Computer-Assisted In-tervention, pp. 663–671.Srinidhi, C.L., Ciga, O., Martel, A.L., 2021. Deep neuralnetwork models for computational histopathology: Asurvey. Medical Image Analysis 67, 101813.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,Salakhutdinov, R., 2014. Dropout: a simple way to pre-vent neural networks from overfitting. The Journal ofMachine Learning Research 15, 1929–1958.Su, H., Shi, X., Cai, J., Yang, L., 2019. Local andglobal consistency regularized mean teacher for semi-supervised nuclei classification, in: International Con-ference on Medical Image Computing and Computer-Assisted Intervention, pp. 559–567.Sun, X., Xu, W., 2014. Fast implementation of delong’salgorithm for comparing the areas under correlated re-ceiver operating characteristic curves. IEEE SignalProcessing Letters 21, 1389–1393.Tarvainen, A., Valpola, H., 2017. Mean teachers are bet-ter role models: Weight-averaged consistency targets23mprove semi-supervised deep learning results, in: Ad-vances in Neural Information Processing Systems, pp.1195–1204.Tellez, D., Litjens, G., B´andi, P., Bulten, W., Bokhorst,J.M., Ciompi, F., van der Laak, J., 2019a. Quantifyingthe e ffff