Learning domain-agnostic visual representation for computational pathology using medically-irrelevant style transfer augmentation
Rikiya Yamashita, Jin Long, Snikitha Banda, Jeanne Shen, Daniel L. Rubin
LLearning domain-agnostic visual representation forcomputational pathology using medically-irrelevant styletransfer augmentation
Rikiya Yamashita, Jin Long, Snikitha Banda, Jeanne Shen, Daniel L. Rubin
Stanford University {rikiya, jinlong, jeannes, rubin}@stanford.edu A BSTRACT
Suboptimal generalization of machine learning models on unseen data is a key challenge whichhampers the clinical applicability of such models to medical imaging. Although various methodssuch as domain adaptation and domain generalization have evolved to combat this challenge, learningrobust and generalizable representations is core to medical image understanding, and continues to bea problem. Here, we propose
STRAP ( S tyle TR ansfer A ugmentation for histo P athology), a formof data augmentation based on random style transfer from artistic paintings, for learning domain-agnostic visual representations in computational pathology. Style transfer replaces the low-leveltexture content of images with the uninformative style of randomly selected artistic paintings, whilepreserving high-level semantic content. This improves robustness to domain shift and can be usedas a simple yet powerful tool for learning domain-agnostic representations. We demonstrate thatSTRAP leads to state-of-the-art performance, particularly in the presence of domain shifts, on aparticular classification task of predicting microsatellite status in colorectal cancer using digitizedhistopathology images. While deep learning has demonstrated remarkable performance on medical imaging tasks over the past few years, theperformance drop usually observed when generalizing from internal to external test data remains a key challenge in themedical application of machine learning models. Supervised learning assumes that training and testing data are sampledfrom the same distribution, i.e ., in-distribution, while in practice, the training and testing data typically originate fromrelated domains, but which follow different distributions, i.e ., out-of-distribution. This phenomenon, known as domainshift [1], hampers the clinical applicability of such models, especially when the annotated datasets are limited in size orthe target domain is highly heterogeneous.One approach to tackling this domain shift problem is domain adaptation, which learns to align the feature distributionof the source domain with that of the target domain in a domain-invariant feature space. However, domain adaptationtypically requires access to at least a few data samples from the target domain, i.e ., testing data, during training, whichis not always available for medical applications. Another approach is domain generalization, which aims to adaptfrom multiple labeled source domains to an unseen target domain without needing to access data samples from thetarget domain. One limitation of these approaches is that they assume the target data are homogeneously sampledfrom the same distribution, an unrealistic scenario in most real-world medical applications, where models must dealwith mixed-domain data ( e.g ., scanner, protocols, medical sites) without their domain labels. In the present study, weaddress the challenging yet practical problem of knowledge transfer from one labeled source domain to multiple targetdomains, a task referred to as domain agnostic learning by Peng et al . [2]. Learning domain-agnostic representation
Correspondence to [email protected]
Code: https://github.com/rikiyay/style-transfer-for-digital-pathology a r X i v : . [ ee ss . I V ] F e b TRAP (medically-irrelevant style transfer) classifier content style domain-irrelevant class-specific domain-specific class-irrelevant style swap uninformative random non-medical image histopathology image
Figure 1: Overview of STRAPis essential for the development of clinically applicable machine learning models, and a solution to domain-agnosticlearning should learn domain-invariant and class-specific visual representations, as humans do.Geirhos et al . [3] showed that 1) convolutional neural networks (CNNs) trained on the ImageNet dataset are biasedtowards texture, whereas humans are more reliant on global shape for distinguishing classes, 2) CNNs tend not to copewell with domain shifts, i.e ., the change in image statistics from those on which the networks have been trained tothose which the networks have never seen before, and 3) increasing shape bias by training on a stylized version of theImageNet generated using style transfer improves accuracy, robustness, and generalizability.Neural style transfer [4] refers to a CNN-based image transformation algorithm that manipulates the low-level texturerepresentation of an image, i.e ., style, while preserving its semantic content. This method uses Gram matrices of theactivations from different layers of a CNN to represent the style of an image. Then it uses an iterative optimizationmethod to generate a new image from white noise by matching the activations with the content image and the Grammatrices with the style image. Jackson et al . [5] demonstrated that, in computer vision tasks for natural images, dataaugmentation via style transfer with randomly selected artistic paintings as a style source improves robustness to domainshift, and can be used as a simple, domain-agnostic alternative to domain adaptation.In medical imaging, machine learning models often suffer from domain shift in test data caused by heterogeneity fromvarious sources, such as scanners, protocols, and medical sites. We know that human experts, such as radiologistsand pathologists, are able to learn domain-agnostic visual representations and, thus, generalize well across domains,particularly in the presence of domain shifts. We postulate that 1) human experts in medical imaging are also biasedtowards shape rather than texture as Geirhos et al . demonstrated [3], and 2) the low-level texture content of an imagetends to be domain-specific, leading to suboptimal performance of deep learning models on domain-shifted unseendata, whereas high-level semantic content is more domain-invariant, from which ubiquitous class-specific visualrepresentations can be learned.Here, we propose
STRAP ( S tyle TR ansfer A ugmentation for histo P athology), a form of data augmentation based onrandom style transfer with artistic paintings, as a solution to learning domain-agnostic visual representation, particularlyin computational pathology (Figure 1). In this study, the term “domain” refers to scanners, stain and scan protocols, and,more broadly, medical sites. To investigate the potential of STRAP for learning domain-agnostic representations, wefocused on the particular task of classifying colorectal cancer into two distinct genetic subtypes based on microsatellitestatus: microsatellite stable (MSS) and unstable (MSI), using image tiles generated from hematoxylin and eosin(H&E)-stained, formalin-fixed, paraffin embedded (FFPE) whole-slide images (WSIs) of surgically resected colorectalcancers. We compare STRAP against two standard baseline methods widely used in computational pathology, i.e ., stain2igure 2: Style transfer with artistic paintings as a style source (stylization coefficient of . ) applied to a histopathologyimage (content on the left). Overall geometry is preserved, but the style, including texture, color, and contrast, isreplaced with an uninformative style of a randomly selected artistic painting.normalization [6] and stain augmentation [7], which apply medically-relevant transformation to the source images,whereas STRAP performs medically-irrelevant transformation (Figure 2). Models were trained on a homogeneoussingle-source dataset and tested on a heterogeneous mixed-domain dataset. To gain insights into the differences inlearning dynamics among the three methods, we performed two assessments: 1) we evaluated differential responses tothe low-frequency components of the test data, and 2) we visualized saliency maps on the low-frequency componentsusing integrated gradients [8]. These experiments were inspired by Wang et al . [9], who showed that 1) CNNs canexploit high-frequency image components which humans do not consciously perceive and 2) models which exploitlow-frequency components generalize better than those which exploit the high-frequency spectrum. We also assessedthe effects of stylization coefficients and different style sources on the STRAP model performance.Our contributions are summarized as follows: 1) we present STRAP, a form of medically-irrelevant data augmentationbased on random style transfer for computational pathology; 2) we utilize STRAP to improve downstream modelperformance and out-of-distribution generalizability on a heterogeneous mixed-domain test dataset for a classificationtask in computational pathology; and 3) our experiments suggest that STRAP helps models exploit low-frequencycomponents of the data, on which humans tend to rely in recognizing objects [10]. Inspired by Geirhos et al . [3] and Jackson et al . [5], we propose STRAP, a form of medically-irrelevant data augmentationbased on random style transfer for computational pathology, which replaces the style of the histopathology image(including texture, color, and contrast) with an uninformative style of a randomly selected non-medical image, whilepredominantly preserving global object shapes. We hypothesize that the style of the histopathology images is domain-specific and class-irrelevant, whereas the global object shape is domain-irrelevant and class-specific; therefore, STRAPfacilitates learning domain-agnostic representations. We constructed a stylized version of the training dataset byapplying AdaIn style transfer [11] following the method proposed in [3]. We prepared the stylized datasets in advance,since on-the-fly random style transfer data augmentation is computationally expensive. Each histopathology imagewas stylized with the style of a randomly selected artistic painting through AdaIN with a stylization coefficientalpha of . . We used the Kaggle’s Painter by Numbers dataset ( , paintings), downloaded via , as a style source. We resized the content histopathology imagesto × pixels and the style painting images to × pixels to maintain geometric features during thestylization. Examples of the stylized histopathology images are shown in Figure 2. We trained the STRAP model solelyon the stylized version of the training dataset. We used three datasets for our analysis; Stanford-CRC [12], CRC-DX-TRAIN, and CRC-DX-TEST [13] (CRC standsfor colorectal cancer). These datasets consists of image patches called tiles, which were generated from the WSIwith a size of × pixels at a resolution of . µ m /pixel and subsequently stain normalized with the Macenko’smethod [6]. 3he Stanford-CRC dataset originates from a single institution, i.e. a homogeneous single-source dataset, and contains , image tiles ( , tiles from MSS and , tiles from MSI H&E-stained FFPE WSI) from uniquepatients. The WSI were originally scanned at × base magnification level ( . µ m /pixel). This single-institutionaldataset has equal class distribution, with MSS and MSI patients, and was used for model development for theout-of-distribution experiment described in section 2.3. To train the STRAP model on the Stanford-CRC, we constructeda stylized version of the Stanford-CRC, termed Stylized-Stanford-CRC, by applying the style transfer method describedin section 2.1, using artistic paintings as a style source.The CRC-DX-TRAIN dataset stems from the TCGA-COAD and TCGA-READ diagnostic slide collections of theCancer Genome Atlas (TCGA) [14], consisting of data from institutions with various scanners and protocols, i.e ., aheterogeneous mixed-domain dataset, and contains , image tiles ( , tiles from MSS and , tilesfrom MSI H&E-stained FFPE WSI) from unique patients. The WSI were scanned at either × or × basemagnification ( . or . µ m /pixel). This multi-institutional dataset was also balanced in class distribution and usedfor model development for the in-distribution analysis described in section 2.8. To train the STRAP model on theCRC-DX-TRAIN, we constructed a stylized version of the CRC-DX-TRAIN by applying the same method used togenerate Stylized-Stanford-CRC using artistic paintings as a style source.The CRC-DX-TEST dataset stems from the same diagnostic slide collections of TCGA as the CRC-DX-TEST dataset, i.e ., consisting of data from institutions with various scanners and protocols, and contains , image tiles( , tiles from MSS and , tiles from MSI H&E-stained FFPE WSI) from unique patients. TheWSI were scanned at either × or × base magnification ( . or . µ m /pixel). This multi-institutional datasetmaintains class imbalance reflecting the real-world prevalence of MSI in colorectal cancer and was solely used forassessing model performance and generalizability. As our main analysis to assess STRAP’s generalizability, we performed an out-of-distribution experiment, where modelswere trained on the Stanford-CRC and tested on the out-of-distribution CRC-DX-TEST dataset. We compared STRAPagainst two standard baseline approaches; stain augmentation (SA) and stain normalization (SN). The STRAP modelwas trained on the Stylized-Stanford-CRC alone, whereas the SA model was trained on non-stylized Stanford-CRCwith on-the-fly stain augmentation following the method described by Tellez et al. [7] (Figure 3) and the SN model wastrained on non-stylized Stanford-CRC that was stain-normalized by the Macenko’s method [6]. Note that all image tilesused for the STRAP and SA methods were also stain normalized in advance using the same method as SN. Therefore,the STRAP and SA approaches were also based on the SN approach. Stain normalization is a widely used method incomputational pathology to account for variations in H&E staining [15, 16, 12, 17]. On the other hand, Tellez et al . [7]demonstrated that stain augmentation improved classification performance when compared to stain normalization, byincreasing the CNN’s ability to generalize to unseen stain variations.Figure 3: Stain augmentation applied to a histopathology image (original on the left).4 .4 Model development and evaluation
All models were trained using the same method. We trained the MobileNetv2 [18] model pretrained on ImageNet [19]via transfer learning with stochastic gradient descent with momentum [20], using a fixed learning rate of − andepoch of , along with early stopping with a patience of five. We used a binary cross entropy loss. All input imageswere resized to × pixels before being fed into the network. Random horizontal and vertical flipping (with aprobability of . for each) and random resized cropping were applied as a common data augmentation method. Tile-wise model outputs were aggregated into a patient-wise score by taking their average. We applied 4-fold cross-validationto account for the selection bias introduced by randomness in data splitting, given the relatively limited sample sizeof the Stanford-CRC. The particular metric of interest was the area under the receiver-operating-characteristic curve(AUROC). In each fold of the cross-validation, the model performance was evaluated on the corresponding test subsetof the Stanford-CRC and on the entire CRC-DX-TEST dataset. An average AUROC and its standard deviation wascomputed subsequently. Of note, the performance of the STRAP model was assessed on the original, i.e. non-stylized,Stanford-CRC images for a fair comparison. To gain insights into what frequency components the three models ( i.e ., STRAP, SA, and SN) exploit for learningrepresentations, we tested model performance on the low-frequency components of the CRC-DX-TEST dataset. Weconstructed the LF-CRC-DX-TEST by following the method described in [9], where all image tiles in the CRC-DX-TEST dataset were decomposed into low- and high-frequency components by applying the fast Fourier transform(FFT) algorithm. Low-frequency components were obtained from the centralized frequency spectrum by applyingcircular low-pass filters with various radii. All frequencies outside the circle were set to zero and the inverse FFT wasapplied subsequently (Figure 4). We also visualized saliency maps on the LF-CRC-DX-TEST using integrated gradientsattributions [8] to highlight which pixels of an input image contribute more to model inference.
FFT
Centered Spectrum Original Image Low Frequency Components High Frequency Components r inverse FFT r {[i*14 for i in range(1, 12)]} 224 x 224 Low Pass Filter High Pass Filter Figure 4: A schema for generating low-frequency components of an image. Image tiles are decomposed into low-and high-frequency components by applying the fast Fourier transform (FFT) algorithm. Low-frequency componentsare extracted from the centralized frequency spectrum by applying circular low-pass filters with various radii. Allfrequencies outside the circle were set to zero and the inverse FFT was applied subsequently. Of note, the high frequencycomponents were not used in this study. 5 .6 Effect of different stylization coefficients
To test the effect of different stylization coefficients, we generated Stylized-Stanford-CRC with three different stylizationcoefficients of . , . , and . , and compared the corresponding STRAP model performance on the CRC-DX-TESTdataset. To evaluate the effect of different style sources, we created Stylized-Stanford-CRC with three distinct datasets as thestyle source; 1) the artistic paintings as described in section 2.1, hereafter referred to as the Artistic Paintings stylesource, 2) the miniImageNet dataset proposed by Vinyals et al . [21] for few-shot learning, consisting of , colorimages from ImageNet with classes, each having examples, hereafter referred to as the Natural Imaging stylesource, and 3) the original Stanford-CRC dataset without stain normalization (to preserve the original variability instaining), hereafter referred to as Histopathologic Imaging style source. The former two apply medically-irrelevanttransformation (Figures 2, 5), whereas the latter applies medically-relevant transformation (Figure 6).Figure 5: Style transfer with the Natural Imaging style source applied to a histopathology image (content on the left).Overall geometry is preserved, but the style, including texture, color, and contrast, is replaced with the uninformativestyle of a randomly selected natural image. The outputs are medically irrelevant and resemble the outputs using theArtistic Paintings style source.Figure 6: Style transfer with randomly selected histopathologic images from the non-stain normalized version of theStanford-CRC dataset, applied to a histopathology image (content on the left). The outputs are medically relevant andresemble the outputs obtained with stain augmentation (Figure 3). We further compared the STRAP model against two state-of-the-art models, Kather et al . [16] and Yamashita etal . [12] in the out-of-distribution scenario described in section 2.3, where models were trained on Stanford-CRCand tested on CRC-DX-TEST. Both models were trained on the non-stylized version of the dataset and used thesame stain normalization technique used in our SN approach. Thus, they can be considered as variations of theSN approach, though there are some differences in model architecture, training protocols, and configuration of data6ugmentation. For example, Kather et al . used a ResNet18 architecture and applied horizontal and vertical flips andrandom translation along the x and y axes for data augmentation. Similarly, Yamashita et al . used a MobileNetV2architecture and applied data augmentation with random horizontal flips, random rotations, and random color jitter.Model performance for Kather et al . [16] and Yamashita et al . [12] was either computed using the code availableat https://github.com/jnkather/MSIfromHE and https://github.com/rikiyay/MSINet , respectively, orobtained from the literature. In addition to the experiment involving the out-of-distribution scenario, we also conductedan additional in-distribution experiment, a replication of Kather et al . [16], in which models were trained on theCRC-DX-TRAIN and tested on the CRC-DX-TEST, both of which were sampled from the same mixed-domain datadistribution. To train the STRAP model on the CRC-DX-TRAIN, we constructed a stylized version of the CRC-DX-TRAIN by applying the same method used to generate Stylized-Stanford-CRC with the Artistic Paintings style source.All of the other models, i.e ., the SA and SN models, as well as Kather et al . and Yamashita et al . , were trained on thenon-stylized version of CRC-DX-TRAIN. Of note, the CRC-DX-TRAIN and CRC-DX-TEST datasets were pre-splitfrom the same diagnostic slide collections of TCGA by Kather [13]; therefore, cross-validation was not applied to thisin-distribution experiment, and the same splits were used for comparison. We assessed model performance on microsatellite status prediction using the AUROC, with confidence intervals(CI) calculated using bootstrapping with the percentile method with , resamples. Statistical comparisons wereperformed using a paired t-test for average AUROCs derived from cross-validation, and using a bootstrapping test with , resamples for individual AUROCs. There were four pairwise comparisons for the experiment described in section2.8, where p-values were adjusted using the Benjamini-Hochberg method [22] to account for multiple comparisonsby controlling the false positive rate to less than . . Otherwise, a two-tailed alpha criterion of . was used forstatistical significance. The STRAP model achieved an average AUROC of . on the out-of-distribution mixed-domain CRC-DX-TESTdataset upon cross-validation, and outperformed the SA and SN models (Table 1). STRAP also demonstrated a minimal,even negative, performance drop from in-distribution to out-of-distribution testing (see column Delta in Table 1).These results suggest that the STRAP model can learn more discriminative and generalizable visual representations,compared to the other two models. Among SA and SN, SA showed higher model performance and smaller performancedrop, compared to SN.Stanford-CRC → Stanford-CRC (ID) Stanford-CRC → CRC-DX-TEST (OOD) Delta‡(ID − OOD)AUROC† p-value (vs STRAP) AUROC† p-value (vs STRAP)STRAP – – − SA 0.826 (0.139) 0.439 0.814 (0.020) 0.001* 0.012SN 0.810 (0.153) 0.406 0.765 (0.031) 0.002* 0.045Table 1:
Comparison of style transfer augmentation (STRAP) with stain augmentation (SA) and stain normal-ization (SN).
Arrows indicate: train data → test data, i.e ., Stanford-CRC → CRC-DX-TEST means training onStanford-CRC and testing on CRC-DX-TEST. * indicates a significant difference. † represents average AUROC ofmodels obtained via cross-validation, with standard deviation in parentheses. ‡ represents average performance dropfrom ID testing (Stanford-CRC → Stanford-CRC) to OOD testing (Stanford-CRC → CRC-DX-TEST). Stylizationcoefficient (alpha) of . was used for the STRAP model. Abbreviations: AUROC, areas under the receiver-operating-characteristic curve; CV, cross-validation; ID, in-distribution; OOD, out-of-distribution; SA, stain augmentation; SN,style normalization; STRAP, style transfer augmentation. We evaluated the STRAP, SA, and SN models on the LF-CRC-DX-TEST dataset with a wide range of low-pass filtersizes. As shown in Figure 7, the STRAP model reached its peak performance at a radius of , whereas the othertwo reached their peaks at radii between and . These results suggest that the STRAP model can exploit lower7igure 7: Results of the experiments using the low-frequency components of the CRC-DX-TEST dataset (LF-CRC-DX-TEST). The x -axis represents the radii of low-pass filters used to generate the LF-CRC-DX-TEST dataset, andthe y -axis shows areas under the receiver-operating-characteristic curves (AUROC). Each dot marker represents thecorresponding peak performance. Low-frequency components Style transfer (STRAP) Stain augmentation (SA) Stain normalization (SN)
Figure 8: Pixel-wise integrated gradient attributions of the low-frequency components (generated with a radius of )of the CRC-DX-TEST dataset (LF-CRC-DX-TEST), visualized as saliency maps for the STRAP, SA, and SN models.8requency components for learning representations, whereas the other two models rely on relatively higher frequencycomponents. This result may explain the superior classification performance and generalizability of the STRAP modelcompared to the other two.Saliency maps with integrated gradients show that the STRAP model presented high attributions at specific areas/nucleiand less diffusely distributed attributions, whereas the SA and SN models showed more broadly distributed attributionsthat might correspond to the low-level texture content of the images (Figure 8). We also tested the effect of the stylization coefficient on STRAP model performance. We found that, among stylizationcoefficients (alphas) of . , . , and . , the larger the stylization coefficient ( i.e ., with an alpha of . ), the higher themodel performance, on average, over models obtained via cross-validation (Table 2), which suggests that the STRAPmodel can learn more discriminative and generalizable representations when more low-level content within an imagewas removed by the style transfer operation. Stanford-CRC → CRC-DX-TESTStylization Coefficient AUROC† p-value (vs SC 1.0)SC . –SC . . Effect of stylization coefficient on STRAP model performance.
Arrow indicates: train data → test data, i.e ., Stanford-CRC → CRC-DX-TEST means training on Stanford-CRC and testing on CRC-DX-TEST. * indicates asignificant difference. † represents average AUROC of models obtained via cross-validation, with standard deviation inparentheses. Abbreviations: AUROC, areas under the receiver-operating-characteristic curve; CV, cross-validation; SC,stylization coefficient.
Among the Artistic Paintings, Natural Imaging, and Histopathologic Imaging style sources, the Artistic Paintingsand Natural Imaging style sources for STRAP achieved superior performance, compared to the style transfer usinghistopathologic images as the style source, for microsatellite status classification. The Artistic Paintings style sourceyielded a significantly higher performance, whereas there was no statistically significant difference in performanceusing the Natural Imaging style and Histopathologic Imaging style sources (Table 3).Stanford-CRC → CRC-DX-TESTStyle Source AUROC† p-value (vs Histopathologic Imaging)Artistic Paintings
Effect of different style sources on STRAP model performance.
Arrow indicates: train data → test data, i.e ., Stanford-CRC → CRC-DX-TEST means training on Stanford-CRC and testing on CRC-DX-TEST. * indicates asignificant difference. † represents average AUROC of models obtained via cross-validation, with standard deviation inparentheses. Stylization coefficient (alpha) of . was used for the STRAP model. Abbreviations: AUROC, areas underthe receiver-operating-characteristic curve; CV, cross-validation. The STRAP model outperformed the two state-of-the-art models, as well as the SA and SN models, in both thein-distribution and out-of-distribution scenarios (Table 4), which suggests that the STRAP model has the ability tolearn more domain-irrelevant and class-specific representations, compared to the other approaches. The STRAP modelachieved a negative performance drop from the in-distribution to out-of-distribution scenarios, whereas all three SN-based methods (SN, Kather et al ., and Yamashita et al .) showed a positive performance drop. The performance drop for9he SA model was relatively close to zero. These results suggest that the STRAP model has the potential to learn morerobust visual representations when trained on the well-curated homogeneous dataset, compared to when it is trained onthe mixed-domain heterogeneous dataset. On the other hand, the SN-based models may exploit some extent of thedomain-specific features, which led to the larger performance drop from the in-distribution to the out-of-distributionsettings.
CRC-DX-TRAIN → CRC-DX-TEST (ID) Stanford-CRC → CRC-DX-TEST (OOD) Delta§(ID − OOD)AUROC† p-value (vs STRAP) AUROC‡ p-value (vs STRAP)STRAP – – − SA 0.816 [0.709, 0.917] 0.471 0.814 (0.020) 0.002* 0.002SN 0.794 [0.684, 0.892] 0.456 0.765 (0.031) 0.003* 0.029Kather et al . 0.759 [0.632, 0.873] 0.219 0.742 (0.013) 0.001* 0.018Yamashita et al . 0.816 [0.712, 0.914] 0.456 0.786 (0.020) 0.010* 0.030
Table 4:
Comparison of style transfer augmentation (STRAP), stain augmentation (SA), and stain normalization(SN) against state-of-the-art models on in-distribution and out-of-distribution scenarios.
Arrows indicate: traindata → test data, e.g ., CRC-DX-TRAIN → CRC-DX-TEST means training on CRC-DX-TRAIN and testing onCRC-DX-TEST. * indicates a significant difference. † represents AUROC with
CI in square brackets. ‡ representsaverage AUROC of models obtained via cross-validation, with standard deviation in parentheses. § indicates averageperformance drop from in-distribution (CRC-DX-TRAIN → CRC-DX-TEST) to out-of-distribution (Stanford-CRC → CRC-DX-TEST) scenarios. Stylization coefficient (alpha) of . was used for the STRAP model. P-values wereadjusted using the Benjamini-Hochberg method [22]. Abbreviations: AUROC, areas under the receiver-operating-characteristic curve; CV, cross-validation; ID, in-distribution; OOD, out-of-distribution; SA, stain augmentation; SN,style normalization; STRAP, style transfer augmentation. We present
STRAP ( S tyle TR ansfer A ugmentation for histo P athology), which achieved superior performance andgeneralizability when compared with two standard baselines (stain augmentation (SA) and stain normalization (SN)), aswell as two state-of-the-art models on the particular classification task of predicting microsatellite status in colorectalcancer using digitized histopathology images.We speculate that STRAP helps models learn domain-agnostic and class-specific visual representations by removingthe original texture and/or high-frequency components from the histopathology images, which are domain-specific andclass-irrelevant, and predominantly leaving shape-biased and/or low-frequency content, which is domain-irrelevant andclass-specific. In fact, more intensive style transfer with a higher stylization coefficient resulted in superior performance.Furthermore, our experiments on the low-frequency components demonstrated that the STRAP approach helps modelsexploit lower frequency components, in contrast to the standard SA and SN approaches, which rely on relatively higherfrequency components. This speculation is also consistent with the hypotheses proposed by Geirhos et al . [3] and Wang et al . [9]—that shape-biased and/or low-frequency features are essential for deep learning models to learn robust andgeneralizable visual representations.To the best of our knowledge, no previous study has applied medically-irrelevant image manipulation to develop deeplearning models for medical imaging. Four previous studies have applied the style transfer technique to medical imagingtasks in computational pathology‘[23, 24] and skin lesion classification [25, 26]. However, these studies employedmedically-relevant transformation in order to combat data scarcity, class imbalance, and stain variation. Our studydemonstrates that medically-irrelevant transformation, i.e ., STRAP with artistic paintings or natural images, can resultin superior performance and generalizability, when compared with medically-relevant transformation, i.e ., style transferwith histopathology images and stain augmentation. A possible explanation for this counterintuitive phenomenon is thatstyle transfer using styles derived from the same dataset as the target content cannot completely remove domain-specificcontent, as the distribution of the style and content overlaps, whereas the medically-irrelevant style transfer removesmore domain-specific components by introducing completely irrelevant texture and color content into the target contentimages. Tobin et al . [27] showed that an object detection model that generalizes to real-world images can be trainedby using unrealistic simulated images with a diverse set of random textures, rather than by making the simulatedimages as realistic as possible. As in the human learning process, learning class-specific and domain-irrelevant patternsfrom data is essential for deep learning models, and the style transfer technique can be a powerful tool to control therepresentations that models learn. 10lthough data augmentation is widely used when training deep learning models for medical imaging tasks, its potentialhas not yet been fully studied and still remains an active area of research. Moreover, an optimal configuration ofdata augmentation methods may vary among applications. As our study suggests, data augmentation can be a simpleyet powerful tool for learning domain-agnostic representation. Further research is warranted to identify optimal dataaugmentation techniques for a variety of medical imaging tasks, and medically-irrelevant transformations such as theproposed STRAP method should be considered, along with established methods.STRAP achieved higher performance in the out-of-distribution setting, compared to the in-distribution setting, whereasopposite results were observed for the other two baseline approaches and the state-of-the-art models. This is counterintu-itive, considering the widely known problem that deep learning models generalize poorly on unseen out-of-distributiondatasets. In the experiment in Section 2.8, the training data in the in-distribution setting came from a mixed-domainheterogeneous dataset, whereas the training images used for the out-of-distribution setting derived from a single-sourcehomogeneous dataset. Although it is often said that diverse multi-institutional datasets are needed for training modelsthat generalize well on unseen data [28], our study suggests that well-curated homogeneous datasets may provide valuein training domain-agnostic models, if a model has sufficient capability to learn domain-invariant and class-specificrepresentations, similar to the way in which humans learn from a set of representative examples ( e.g ., content presentedin textbooks).Besides supervised learning, our approach may be applicable to self-supervised learning. A contrastive learningframework, such as SimCLR [29] and MoCo [30], learns representations by maximizing agreement between differentlyaugmented views of the same data example via a contrastive loss (thus, relying heavily on a stochastic data augmentationmodule). Chen et al . [29] showed that the composition of data augmentation operations is crucial in yielding effectiverepresentations, and that unsupervised contrastive learning benefits from strong data augmentation. In medical imaging,contrastive learning may require a tailored composition of data augmentation operations, and our STRAP has thepotential to serve as one of the core transformation operations.One limitation of this study is that we only tested our approach with one particular classification task in the field ofcomputational pathology. Further studies are warranted to investigate whether our approach could prove its efficacy androbustness 1) for other classification tasks, 2) for non-classification tasks such as detection, segmentation, and survivalprediction, and 3) in other medical imaging domains, such as radiology, ophthalmology, and dermatology.In conclusion, we have introduced STRAP, a form of data augmentation based on random style transfer with artisticpaintings, for learning domain-agnostic visual representations in computational pathology. Our experiments demon-strated that our approach yields significant improvements in test performance on a specific classification task, particularlyin the presence of domain shift. Our study provides evidence that 1) CNNs are reliant on low-level texture content andare therefore vulnerable to domain shifts, and that 2) STRAP can be a practical tool for mitigating that reliance and,therefore, a possible solution for learning domain-agnostic representations in computational pathology. Acknowledgements
This work was funded by the Stanford Departments of Biomedical Data Science and Pathology, through a StanfordClinical Data Science Fellowship to RY. We would like to thank Blaine Burton Rister for detailed and valuable feedbackon the manuscript. We would also like to thank Nandita Bhaskhar, Khaled Kamal Saab, and Jared Dunnmon for theirhelpful discussions.
References [1] Joaquin Quiñonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer.
Dataset Shift inMachine Learning . MIT Press, 2009.[2] Xingchao Peng, Zijun Huang, Ximeng Sun, and Kate Saenko. Domain agnostic learning with disentangledrepresentations. In
ICML , 2019.[3] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel.ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In
International Conference on Learning Representations , 2019.[4] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks.In , pages 2414–2423, June 2016.[5] Philip T G Jackson, Amir Atapour Abarghouei, Stephen Bonner, Toby P Breckon, and Boguslaw Obara. Styleaugmentation: data augmentation via style randomization. In
CVPR Workshops , pages 83–92, 2019.116] Marc Macenko, Marc Niethammer, J S Marron, David Borland, John T Woosley, Xiaojun Guan, Charles Schmitt,and Nancy E Thomas. A method for normalizing histology slides for quantitative analysis. In , pages 1107–1110. ieeexplore.ieee.org,June 2009.[7] David Tellez, Geert Litjens, Péter Bándi, Wouter Bulten, John-Melle Bokhorst, Francesco Ciompi, and Jeroenvan der Laak. Quantifying the effects of data augmentation and stain color normalization in convolutional neuralnetworks for computational pathology.
Med. Image Anal. , 58:101544, December 2019.[8] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprintarXiv:1703.01365 , 2017.[9] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the general-ization of convolutional neural networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pages 8684–8694, 2020.[10] Bhuvanesh Awasthi, Jason Friedman, and Mark A Williams. Faster, stronger, lateralized: low spatial frequencyinformation supports face processing.
Neuropsychologia , 49(13):3583–3590, November 2011.[11] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In
Proceedings of the IEEE International Conference on Computer Vision , pages 1501–1510, 2017.[12] Rikiya Yamashita, Jin Long, Teri Longacre, Lan Peng, Gerald Berry, Brock Martin, John Higgins, Daniel L Rubin,and Jeanne Shen. Deep learning model for the prediction of microsatellite instability in colorectal cancer: adiagnostic study.
Lancet Oncol. , 22(1):132–141, January 2021.[13] Jakob Nikolas Kather. Histological images for MSI vs. MSS classification in gastrointestinal cancer, FFPEsamples [data set] Zenodo. http://doi.org/10.5281/zenodo.2530835 , 2019.[14] The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer.
Nature , 487(7407):330–337, July 2012.[15] Amelie Echle, Heike Irmgard Grabsch, Philip Quirke, Piet A van den Brandt, Nicholas P West, Gordon G AHutchins, Lara R Heij, Xiuxiang Tan, Susan D Richman, Jeremias Krause, Elizabeth Alwers, Josien Jenniskens,Kelly Offermans, Richard Gray, Hermann Brenner, Jenny Chang-Claude, Christian Trautwein, Alexander TPearson, Peter Boor, Tom Luedde, Nadine Therese Gaisa, Michael Hoffmeister, and Jakob Nikolas Kather.Clinical-Grade detection of microsatellite instability in colorectal tumors by deep learning.
Gastroenterology ,159(4):1406–1416.e11, October 2020.[16] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, Dirk Jäger, Jeremias Krause, Sven H Loosen,Alexander Marx, Peter Boor, Frank Tacke, Ulf Peter Neumann, Heike I Grabsch, Takaki Yoshikawa, HermannBrenner, Jenny Chang-Claude, Michael Hoffmeister, Christian Trautwein, and Tom Luedde. Deep learning canpredict microsatellite instability directly from histology in gastrointestinal cancer.
Nat. Med. , 25(7):1054–1056,June 2019.[17] Daisuke Komura and Shumpei Ishikawa. Machine learning methods for histopathological image analysis.
Comput.Struct. Biotechnol. J. , 16:34–42, February 2018.[18] M Sandler, A Howard, M Zhu, A Zhmoginov, and L Chen. MobileNetV2: Inverted residuals and linear bottlenecks.In , pages 4510–4520, June 2018.[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. ImageNet large scale visualrecognition challenge.
Int. J. Comput. Vis. , 115(3):211–252, December 2015.[20] Ning Qian. On the momentum term in gradient descent learning algorithms.
Neural Netw. , 12(1):145–151, January1999.[21] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networksfor one shot learning. In D Lee, M Sugiyama, U Luxburg, I Guyon, and R Garnett, editors,
Advances in NeuralInformation Processing Systems , volume 29, pages 3630–3638. Curran Associates, Inc., 2016.[22] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach tomultiple testing.
J. R. Stat. Soc. Series B Stat. Methodol. , 57(1):289–300, 1995.[23] Pietro Antonio Cicalese, Aryan Mobiny, Pengyu Yuan, Jan Becker, Chandra Mohan, and Hien Van Nguyen. Sty-Path: Style-Transfer data augmentation for robust histology image classification. arXiv preprint arXiv:2007.05008 ,2020.[24] Seo Jeong Shin, Seng Chan You, Hokyun Jeon, Ji Won Jung, Min Ho An, Rae Woong Park, and Jin Roh. Styletransfer strategy for developing a generalizable deep learning application in digital pathology.
Comput. MethodsPrograms Biomed. , 198:105815, January 2021. 1225] Agnieszka Mikołajczyk and Michał Grochowski. Style transfer-based image synthesis as an efficient regularizationtechnique in deep learning. In , pages 42–47, 2019.[26] Tamás Nyíri and Attila Kiss. Style transfer for dermatological data augmentation. In
Intelligent Systems andApplications , pages 915–923. Springer International Publishing, 2020.[27] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomiza-tion for transferring deep neural networks from simulation to the real world. arXiv preprint arXiv:1703.06907 ,2017.[28] Christopher J Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King. Key challengesfor delivering clinical impact with artificial intelligence.
BMC Med. , 17(1):195, October 2019.[29] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastivelearning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[30] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervisedvisual representation learning. In