[PDF] Learning domain-agnostic visual representation for computational pathology using medically-irrelevant style transfer augmentation

Abstract

Suboptimal generalization of machine learning models on unseen data is a key challenge which hampers the clinical applicability of such models to medical imaging. Although various methods such as domain adaptation and domain generalization have evolved to combat this challenge, learning robust and generalizable representations is core to medical image understanding, and continues to be a problem. Here, we propose STRAP (Style TRansfer Augmentation for histoPathology), a form of data augmentation based on random style transfer from non-medical style source such as artistic paintings, for learning domain-agnostic visual representations in computational pathology. Style transfer replaces the low-level texture content of an image with the uninformative style of randomly selected style source image, while preserving the original high-level semantic content. This improves robustness to domain shift and can be used as a simple yet powerful tool for learning domain-agnostic representations. We demonstrate that STRAP leads to state-of-the-art performance, particularly in the presence of domain shifts, on two particular classification tasks in computational pathology.

Full PDF

LLearning domain-agnostic visual representation forcomputational pathology using medically-irrelevant styletransfer augmentation

Rikiya Yamashita, Jin Long, Snikitha Banda, Jeanne Shen, Daniel L. Rubin

Stanford University {rikiya, jinlong, jeannes, rubin}@stanford.edu A BSTRACT

STRAP ( S tyle TR ansfer A ugmentation for histo P athology), a formof data augmentation based on random style transfer from artistic paintings, for learning domain-agnostic visual representations in computational pathology. Style transfer replaces the low-leveltexture content of images with the uninformative style of randomly selected artistic paintings, whilepreserving high-level semantic content. This improves robustness to domain shift and can be usedas a simple yet powerful tool for learning domain-agnostic representations. We demonstrate thatSTRAP leads to state-of-the-art performance, particularly in the presence of domain shifts, on aparticular classiﬁcation task of predicting microsatellite status in colorectal cancer using digitizedhistopathology images. While deep learning has demonstrated remarkable performance on medical imaging tasks over the past few years, theperformance drop usually observed when generalizing from internal to external test data remains a key challenge in themedical application of machine learning models. Supervised learning assumes that training and testing data are sampledfrom the same distribution, i.e ., in-distribution, while in practice, the training and testing data typically originate fromrelated domains, but which follow different distributions, i.e ., out-of-distribution. This phenomenon, known as domainshift [1], hampers the clinical applicability of such models, especially when the annotated datasets are limited in size orthe target domain is highly heterogeneous.One approach to tackling this domain shift problem is domain adaptation, which learns to align the feature distributionof the source domain with that of the target domain in a domain-invariant feature space. However, domain adaptationtypically requires access to at least a few data samples from the target domain, i.e ., testing data, during training, whichis not always available for medical applications. Another approach is domain generalization, which aims to adaptfrom multiple labeled source domains to an unseen target domain without needing to access data samples from thetarget domain. One limitation of these approaches is that they assume the target data are homogeneously sampledfrom the same distribution, an unrealistic scenario in most real-world medical applications, where models must dealwith mixed-domain data ( e.g ., scanner, protocols, medical sites) without their domain labels. In the present study, weaddress the challenging yet practical problem of knowledge transfer from one labeled source domain to multiple targetdomains, a task referred to as domain agnostic learning by Peng et al . [2]. Learning domain-agnostic representation

Correspondence to [email protected]

Code: https://github.com/rikiyay/style-transfer-for-digital-pathology a r X i v : . [ ee ss . I V ] F e b TRAP (medically-irrelevant style transfer) classifier content style domain-irrelevant class-specific domain-specific class-irrelevant style swap uninformative random non-medical image histopathology image

Figure 1: Overview of STRAPis essential for the development of clinically applicable machine learning models, and a solution to domain-agnosticlearning should learn domain-invariant and class-speciﬁc visual representations, as humans do.Geirhos et al . [3] showed that 1) convolutional neural networks (CNNs) trained on the ImageNet dataset are biasedtowards texture, whereas humans are more reliant on global shape for distinguishing classes, 2) CNNs tend not to copewell with domain shifts, i.e ., the change in image statistics from those on which the networks have been trained tothose which the networks have never seen before, and 3) increasing shape bias by training on a stylized version of theImageNet generated using style transfer improves accuracy, robustness, and generalizability.Neural style transfer [4] refers to a CNN-based image transformation algorithm that manipulates the low-level texturerepresentation of an image, i.e ., style, while preserving its semantic content. This method uses Gram matrices of theactivations from different layers of a CNN to represent the style of an image. Then it uses an iterative optimizationmethod to generate a new image from white noise by matching the activations with the content image and the Grammatrices with the style image. Jackson et al . [5] demonstrated that, in computer vision tasks for natural images, dataaugmentation via style transfer with randomly selected artistic paintings as a style source improves robustness to domainshift, and can be used as a simple, domain-agnostic alternative to domain adaptation.In medical imaging, machine learning models often suffer from domain shift in test data caused by heterogeneity fromvarious sources, such as scanners, protocols, and medical sites. We know that human experts, such as radiologistsand pathologists, are able to learn domain-agnostic visual representations and, thus, generalize well across domains,particularly in the presence of domain shifts. We postulate that 1) human experts in medical imaging are also biasedtowards shape rather than texture as Geirhos et al . demonstrated [3], and 2) the low-level texture content of an imagetends to be domain-speciﬁc, leading to suboptimal performance of deep learning models on domain-shifted unseendata, whereas high-level semantic content is more domain-invariant, from which ubiquitous class-speciﬁc visualrepresentations can be learned.Here, we propose

STRAP ( S tyle TR ansfer A ugmentation for histo P athology), a form of data augmentation based onrandom style transfer with artistic paintings, as a solution to learning domain-agnostic visual representation, particularlyin computational pathology (Figure 1). In this study, the term “domain” refers to scanners, stain and scan protocols, and,more broadly, medical sites. To investigate the potential of STRAP for learning domain-agnostic representations, wefocused on the particular task of classifying colorectal cancer into two distinct genetic subtypes based on microsatellitestatus: microsatellite stable (MSS) and unstable (MSI), using image tiles generated from hematoxylin and eosin(H&E)-stained, formalin-ﬁxed, parafﬁn embedded (FFPE) whole-slide images (WSIs) of surgically resected colorectalcancers. We compare STRAP against two standard baseline methods widely used in computational pathology, i.e ., stain2igure 2: Style transfer with artistic paintings as a style source (stylization coefﬁcient of . ) applied to a histopathologyimage (content on the left). Overall geometry is preserved, but the style, including texture, color, and contrast, isreplaced with an uninformative style of a randomly selected artistic painting.normalization [6] and stain augmentation [7], which apply medically-relevant transformation to the source images,whereas STRAP performs medically-irrelevant transformation (Figure 2). Models were trained on a homogeneoussingle-source dataset and tested on a heterogeneous mixed-domain dataset. To gain insights into the differences inlearning dynamics among the three methods, we performed two assessments: 1) we evaluated differential responses tothe low-frequency components of the test data, and 2) we visualized saliency maps on the low-frequency componentsusing integrated gradients [8]. These experiments were inspired by Wang et al . [9], who showed that 1) CNNs canexploit high-frequency image components which humans do not consciously perceive and 2) models which exploitlow-frequency components generalize better than those which exploit the high-frequency spectrum. We also assessedthe effects of stylization coefﬁcients and different style sources on the STRAP model performance.Our contributions are summarized as follows: 1) we present STRAP, a form of medically-irrelevant data augmentationbased on random style transfer for computational pathology; 2) we utilize STRAP to improve downstream modelperformance and out-of-distribution generalizability on a heterogeneous mixed-domain test dataset for a classiﬁcationtask in computational pathology; and 3) our experiments suggest that STRAP helps models exploit low-frequencycomponents of the data, on which humans tend to rely in recognizing objects [10]. Inspired by Geirhos et al . [3] and Jackson et al . [5], we propose STRAP, a form of medically-irrelevant data augmentationbased on random style transfer for computational pathology, which replaces the style of the histopathology image(including texture, color, and contrast) with an uninformative style of a randomly selected non-medical image, whilepredominantly preserving global object shapes. We hypothesize that the style of the histopathology images is domain-speciﬁc and class-irrelevant, whereas the global object shape is domain-irrelevant and class-speciﬁc; therefore, STRAPfacilitates learning domain-agnostic representations. We constructed a stylized version of the training dataset byapplying AdaIn style transfer [11] following the method proposed in [3]. We prepared the stylized datasets in advance,since on-the-ﬂy random style transfer data augmentation is computationally expensive. Each histopathology imagewas stylized with the style of a randomly selected artistic painting through AdaIN with a stylization coefﬁcientalpha of . . We used the Kaggle’s Painter by Numbers dataset ( , paintings), downloaded via , as a style source. We resized the content histopathology imagesto × pixels and the style painting images to × pixels to maintain geometric features during thestylization. Examples of the stylized histopathology images are shown in Figure 2. We trained the STRAP model solelyon the stylized version of the training dataset. We used three datasets for our analysis; Stanford-CRC [12], CRC-DX-TRAIN, and CRC-DX-TEST [13] (CRC standsfor colorectal cancer). These datasets consists of image patches called tiles, which were generated from the WSIwith a size of × pixels at a resolution of . µ m /pixel and subsequently stain normalized with the Macenko’smethod [6]. 3he Stanford-CRC dataset originates from a single institution, i.e. a homogeneous single-source dataset, and contains , image tiles ( , tiles from MSS and , tiles from MSI H&E-stained FFPE WSI) from uniquepatients. The WSI were originally scanned at × base magniﬁcation level ( . µ m /pixel). This single-institutionaldataset has equal class distribution, with MSS and MSI patients, and was used for model development for theout-of-distribution experiment described in section 2.3. To train the STRAP model on the Stanford-CRC, we constructeda stylized version of the Stanford-CRC, termed Stylized-Stanford-CRC, by applying the style transfer method describedin section 2.1, using artistic paintings as a style source.The CRC-DX-TRAIN dataset stems from the TCGA-COAD and TCGA-READ diagnostic slide collections of theCancer Genome Atlas (TCGA) [14], consisting of data from institutions with various scanners and protocols, i.e ., aheterogeneous mixed-domain dataset, and contains , image tiles ( , tiles from MSS and , tilesfrom MSI H&E-stained FFPE WSI) from unique patients. The WSI were scanned at either × or × basemagniﬁcation ( . or . µ m /pixel). This multi-institutional dataset was also balanced in class distribution and usedfor model development for the in-distribution analysis described in section 2.8. To train the STRAP model on theCRC-DX-TRAIN, we constructed a stylized version of the CRC-DX-TRAIN by applying the same method used togenerate Stylized-Stanford-CRC using artistic paintings as a style source.The CRC-DX-TEST dataset stems from the same diagnostic slide collections of TCGA as the CRC-DX-TEST dataset, i.e ., consisting of data from institutions with various scanners and protocols, and contains , image tiles( , tiles from MSS and , tiles from MSI H&E-stained FFPE WSI) from unique patients. TheWSI were scanned at either × or × base magniﬁcation ( . or . µ m /pixel). This multi-institutional datasetmaintains class imbalance reﬂecting the real-world prevalence of MSI in colorectal cancer and was solely used forassessing model performance and generalizability. As our main analysis to assess STRAP’s generalizability, we performed an out-of-distribution experiment, where modelswere trained on the Stanford-CRC and tested on the out-of-distribution CRC-DX-TEST dataset. We compared STRAPagainst two standard baseline approaches; stain augmentation (SA) and stain normalization (SN). The STRAP modelwas trained on the Stylized-Stanford-CRC alone, whereas the SA model was trained on non-stylized Stanford-CRCwith on-the-ﬂy stain augmentation following the method described by Tellez et al. [7] (Figure 3) and the SN model wastrained on non-stylized Stanford-CRC that was stain-normalized by the Macenko’s method [6]. Note that all image tilesused for the STRAP and SA methods were also stain normalized in advance using the same method as SN. Therefore,the STRAP and SA approaches were also based on the SN approach. Stain normalization is a widely used method incomputational pathology to account for variations in H&E staining [15, 16, 12, 17]. On the other hand, Tellez et al . [7]demonstrated that stain augmentation improved classiﬁcation performance when compared to stain normalization, byincreasing the CNN’s ability to generalize to unseen stain variations.Figure 3: Stain augmentation applied to a histopathology image (original on the left).4 .4 Model development and evaluation

All models were trained using the same method. We trained the MobileNetv2 [18] model pretrained on ImageNet [19]via transfer learning with stochastic gradient descent with momentum [20], using a ﬁxed learning rate of − andepoch of , along with early stopping with a patience of ﬁve. We used a binary cross entropy loss. All input imageswere resized to × pixels before being fed into the network. Random horizontal and vertical ﬂipping (with aprobability of . for each) and random resized cropping were applied as a common data augmentation method. Tile-wise model outputs were aggregated into a patient-wise score by taking their average. We applied 4-fold cross-validationto account for the selection bias introduced by randomness in data splitting, given the relatively limited sample sizeof the Stanford-CRC. The particular metric of interest was the area under the receiver-operating-characteristic curve(AUROC). In each fold of the cross-validation, the model performance was evaluated on the corresponding test subsetof the Stanford-CRC and on the entire CRC-DX-TEST dataset. An average AUROC and its standard deviation wascomputed subsequently. Of note, the performance of the STRAP model was assessed on the original, i.e. non-stylized,Stanford-CRC images for a fair comparison. To gain insights into what frequency components the three models ( i.e ., STRAP, SA, and SN) exploit for learningrepresentations, we tested model performance on the low-frequency components of the CRC-DX-TEST dataset. Weconstructed the LF-CRC-DX-TEST by following the method described in [9], where all image tiles in the CRC-DX-TEST dataset were decomposed into low- and high-frequency components by applying the fast Fourier transform(FFT) algorithm. Low-frequency components were obtained from the centralized frequency spectrum by applyingcircular low-pass ﬁlters with various radii. All frequencies outside the circle were set to zero and the inverse FFT wasapplied subsequently (Figure 4). We also visualized saliency maps on the LF-CRC-DX-TEST using integrated gradientsattributions [8] to highlight which pixels of an input image contribute more to model inference.

FFT

Centered Spectrum Original Image Low Frequency Components High Frequency Components r inverse FFT r {[i*14 for i in range(1, 12)]} 224 x 224 Low Pass Filter High Pass Filter Figure 4: A schema for generating low-frequency components of an image. Image tiles are decomposed into low-and high-frequency components by applying the fast Fourier transform (FFT) algorithm. Low-frequency componentsare extracted from the centralized frequency spectrum by applying circular low-pass ﬁlters with various radii. Allfrequencies outside the circle were set to zero and the inverse FFT was applied subsequently. Of note, the high frequencycomponents were not used in this study. 5 .6 Effect of different stylization coefﬁcients

To test the effect of different stylization coefﬁcients, we generated Stylized-Stanford-CRC with three different stylizationcoefﬁcients of . , . , and . , and compared the corresponding STRAP model performance on the CRC-DX-TESTdataset. To evaluate the effect of different style sources, we created Stylized-Stanford-CRC with three distinct datasets as thestyle source; 1) the artistic paintings as described in section 2.1, hereafter referred to as the Artistic Paintings stylesource, 2) the miniImageNet dataset proposed by Vinyals et al . [21] for few-shot learning, consisting of , colorimages from ImageNet with classes, each having examples, hereafter referred to as the Natural Imaging stylesource, and 3) the original Stanford-CRC dataset without stain normalization (to preserve the original variability instaining), hereafter referred to as Histopathologic Imaging style source. The former two apply medically-irrelevanttransformation (Figures 2, 5), whereas the latter applies medically-relevant transformation (Figure 6).Figure 5: Style transfer with the Natural Imaging style source applied to a histopathology image (content on the left).Overall geometry is preserved, but the style, including texture, color, and contrast, is replaced with the uninformativestyle of a randomly selected natural image. The outputs are medically irrelevant and resemble the outputs using theArtistic Paintings style source.Figure 6: Style transfer with randomly selected histopathologic images from the non-stain normalized version of theStanford-CRC dataset, applied to a histopathology image (content on the left). The outputs are medically relevant andresemble the outputs obtained with stain augmentation (Figure 3). We further compared the STRAP model against two state-of-the-art models, Kather et al . [16] and Yamashita etal . [12] in the out-of-distribution scenario described in section 2.3, where models were trained on Stanford-CRCand tested on CRC-DX-TEST. Both models were trained on the non-stylized version of the dataset and used thesame stain normalization technique used in our SN approach. Thus, they can be considered as variations of theSN approach, though there are some differences in model architecture, training protocols, and conﬁguration of data6ugmentation. For example, Kather et al . used a ResNet18 architecture and applied horizontal and vertical ﬂips andrandom translation along the x and y axes for data augmentation. Similarly, Yamashita et al . used a MobileNetV2architecture and applied data augmentation with random horizontal ﬂips, random rotations, and random color jitter.Model performance for Kather et al . [16] and Yamashita et al . [12] was either computed using the code availableat https://github.com/jnkather/MSIfromHE and https://github.com/rikiyay/MSINet , respectively, orobtained from the literature. In addition to the experiment involving the out-of-distribution scenario, we also conductedan additional in-distribution experiment, a replication of Kather et al . [16], in which models were trained on theCRC-DX-TRAIN and tested on the CRC-DX-TEST, both of which were sampled from the same mixed-domain datadistribution. To train the STRAP model on the CRC-DX-TRAIN, we constructed a stylized version of the CRC-DX-TRAIN by applying the same method used to generate Stylized-Stanford-CRC with the Artistic Paintings style source.All of the other models, i.e ., the SA and SN models, as well as Kather et al . and Yamashita et al . , were trained on thenon-stylized version of CRC-DX-TRAIN. Of note, the CRC-DX-TRAIN and CRC-DX-TEST datasets were pre-splitfrom the same diagnostic slide collections of TCGA by Kather [13]; therefore, cross-validation was not applied to thisin-distribution experiment, and the same splits were used for comparison. We assessed model performance on microsatellite status prediction using the AUROC, with conﬁdence intervals(CI) calculated using bootstrapping with the percentile method with , resamples. Statistical comparisons wereperformed using a paired t-test for average AUROCs derived from cross-validation, and using a bootstrapping test with , resamples for individual AUROCs. There were four pairwise comparisons for the experiment described in section2.8, where p-values were adjusted using the Benjamini-Hochberg method [22] to account for multiple comparisonsby controlling the false positive rate to less than . . Otherwise, a two-tailed alpha criterion of . was used forstatistical signiﬁcance. The STRAP model achieved an average AUROC of . on the out-of-distribution mixed-domain CRC-DX-TESTdataset upon cross-validation, and outperformed the SA and SN models (Table 1). STRAP also demonstrated a minimal,even negative, performance drop from in-distribution to out-of-distribution testing (see column Delta in Table 1).These results suggest that the STRAP model can learn more discriminative and generalizable visual representations,compared to the other two models. Among SA and SN, SA showed higher model performance and smaller performancedrop, compared to SN.Stanford-CRC → Stanford-CRC (ID) Stanford-CRC → CRC-DX-TEST (OOD) Delta‡(ID − OOD)AUROC† p-value (vs STRAP) AUROC† p-value (vs STRAP)STRAP – – − SA 0.826 (0.139) 0.439 0.814 (0.020) 0.001* 0.012SN 0.810 (0.153) 0.406 0.765 (0.031) 0.002* 0.045Table 1:

Comparison of style transfer augmentation (STRAP) with stain augmentation (SA) and stain normal-ization (SN).

Arrows indicate: train data → test data, i.e ., Stanford-CRC → CRC-DX-TEST means training onStanford-CRC and testing on CRC-DX-TEST. * indicates a signiﬁcant difference. † represents average AUROC ofmodels obtained via cross-validation, with standard deviation in parentheses. ‡ represents average performance dropfrom ID testing (Stanford-CRC → Stanford-CRC) to OOD testing (Stanford-CRC → CRC-DX-TEST). Stylizationcoefﬁcient (alpha) of . was used for the STRAP model. Abbreviations: AUROC, areas under the receiver-operating-characteristic curve; CV, cross-validation; ID, in-distribution; OOD, out-of-distribution; SA, stain augmentation; SN,style normalization; STRAP, style transfer augmentation. We evaluated the STRAP, SA, and SN models on the LF-CRC-DX-TEST dataset with a wide range of low-pass ﬁltersizes. As shown in Figure 7, the STRAP model reached its peak performance at a radius of , whereas the othertwo reached their peaks at radii between and . These results suggest that the STRAP model can exploit lower7igure 7: Results of the experiments using the low-frequency components of the CRC-DX-TEST dataset (LF-CRC-DX-TEST). The x -axis represents the radii of low-pass ﬁlters used to generate the LF-CRC-DX-TEST dataset, andthe y -axis shows areas under the receiver-operating-characteristic curves (AUROC). Each dot marker represents thecorresponding peak performance. Low-frequency components Style transfer (STRAP) Stain augmentation (SA) Stain normalization (SN)

Figure 8: Pixel-wise integrated gradient attributions of the low-frequency components (generated with a radius of )of the CRC-DX-TEST dataset (LF-CRC-DX-TEST), visualized as saliency maps for the STRAP, SA, and SN models.8requency components for learning representations, whereas the other two models rely on relatively higher frequencycomponents. This result may explain the superior classiﬁcation performance and generalizability of the STRAP modelcompared to the other two.Saliency maps with integrated gradients show that the STRAP model presented high attributions at speciﬁc areas/nucleiand less diffusely distributed attributions, whereas the SA and SN models showed more broadly distributed attributionsthat might correspond to the low-level texture content of the images (Figure 8). We also tested the effect of the stylization coefﬁcient on STRAP model performance. We found that, among stylizationcoefﬁcients (alphas) of . , . , and . , the larger the stylization coefﬁcient ( i.e ., with an alpha of . ), the higher themodel performance, on average, over models obtained via cross-validation (Table 2), which suggests that the STRAPmodel can learn more discriminative and generalizable representations when more low-level content within an imagewas removed by the style transfer operation. Stanford-CRC → CRC-DX-TESTStylization Coefﬁcient AUROC† p-value (vs SC 1.0)SC . –SC . . Effect of stylization coefﬁcient on STRAP model performance.

Among the Artistic Paintings, Natural Imaging, and Histopathologic Imaging style sources, the Artistic Paintingsand Natural Imaging style sources for STRAP achieved superior performance, compared to the style transfer usinghistopathologic images as the style source, for microsatellite status classiﬁcation. The Artistic Paintings style sourceyielded a signiﬁcantly higher performance, whereas there was no statistically signiﬁcant difference in performanceusing the Natural Imaging style and Histopathologic Imaging style sources (Table 3).Stanford-CRC → CRC-DX-TESTStyle Source AUROC† p-value (vs Histopathologic Imaging)Artistic Paintings

Effect of different style sources on STRAP model performance.

Arrow indicates: train data → test data, i.e ., Stanford-CRC → CRC-DX-TEST means training on Stanford-CRC and testing on CRC-DX-TEST. * indicates asigniﬁcant difference. † represents average AUROC of models obtained via cross-validation, with standard deviation inparentheses. Stylization coefﬁcient (alpha) of . was used for the STRAP model. Abbreviations: AUROC, areas underthe receiver-operating-characteristic curve; CV, cross-validation. The STRAP model outperformed the two state-of-the-art models, as well as the SA and SN models, in both thein-distribution and out-of-distribution scenarios (Table 4), which suggests that the STRAP model has the ability tolearn more domain-irrelevant and class-speciﬁc representations, compared to the other approaches. The STRAP modelachieved a negative performance drop from the in-distribution to out-of-distribution scenarios, whereas all three SN-based methods (SN, Kather et al ., and Yamashita et al .) showed a positive performance drop. The performance drop for9he SA model was relatively close to zero. These results suggest that the STRAP model has the potential to learn morerobust visual representations when trained on the well-curated homogeneous dataset, compared to when it is trained onthe mixed-domain heterogeneous dataset. On the other hand, the SN-based models may exploit some extent of thedomain-speciﬁc features, which led to the larger performance drop from the in-distribution to the out-of-distributionsettings.

CRC-DX-TRAIN → CRC-DX-TEST (ID) Stanford-CRC → CRC-DX-TEST (OOD) Delta§(ID − OOD)AUROC† p-value (vs STRAP) AUROC‡ p-value (vs STRAP)STRAP – – − SA 0.816 [0.709, 0.917] 0.471 0.814 (0.020) 0.002* 0.002SN 0.794 [0.684, 0.892] 0.456 0.765 (0.031) 0.003* 0.029Kather et al . 0.759 [0.632, 0.873] 0.219 0.742 (0.013) 0.001* 0.018Yamashita et al . 0.816 [0.712, 0.914] 0.456 0.786 (0.020) 0.010* 0.030

Table 4:

Comparison of style transfer augmentation (STRAP), stain augmentation (SA), and stain normalization(SN) against state-of-the-art models on in-distribution and out-of-distribution scenarios.

Arrows indicate: traindata → test data, e.g ., CRC-DX-TRAIN → CRC-DX-TEST means training on CRC-DX-TRAIN and testing onCRC-DX-TEST. * indicates a signiﬁcant difference. † represents AUROC with

CI in square brackets. ‡ representsaverage AUROC of models obtained via cross-validation, with standard deviation in parentheses. § indicates averageperformance drop from in-distribution (CRC-DX-TRAIN → CRC-DX-TEST) to out-of-distribution (Stanford-CRC → CRC-DX-TEST) scenarios. Stylization coefﬁcient (alpha) of . was used for the STRAP model. P-values wereadjusted using the Benjamini-Hochberg method [22]. Abbreviations: AUROC, areas under the receiver-operating-characteristic curve; CV, cross-validation; ID, in-distribution; OOD, out-of-distribution; SA, stain augmentation; SN,style normalization; STRAP, style transfer augmentation. We present

STRAP ( S tyle TR ansfer A ugmentation for histo P athology), which achieved superior performance andgeneralizability when compared with two standard baselines (stain augmentation (SA) and stain normalization (SN)), aswell as two state-of-the-art models on the particular classiﬁcation task of predicting microsatellite status in colorectalcancer using digitized histopathology images.We speculate that STRAP helps models learn domain-agnostic and class-speciﬁc visual representations by removingthe original texture and/or high-frequency components from the histopathology images, which are domain-speciﬁc andclass-irrelevant, and predominantly leaving shape-biased and/or low-frequency content, which is domain-irrelevant andclass-speciﬁc. In fact, more intensive style transfer with a higher stylization coefﬁcient resulted in superior performance.Furthermore, our experiments on the low-frequency components demonstrated that the STRAP approach helps modelsexploit lower frequency components, in contrast to the standard SA and SN approaches, which rely on relatively higherfrequency components. This speculation is also consistent with the hypotheses proposed by Geirhos et al . [3] and Wang et al . [9]—that shape-biased and/or low-frequency features are essential for deep learning models to learn robust andgeneralizable visual representations.To the best of our knowledge, no previous study has applied medically-irrelevant image manipulation to develop deeplearning models for medical imaging. Four previous studies have applied the style transfer technique to medical imagingtasks in computational pathology‘[23, 24] and skin lesion classiﬁcation [25, 26]. However, these studies employedmedically-relevant transformation in order to combat data scarcity, class imbalance, and stain variation. Our studydemonstrates that medically-irrelevant transformation, i.e ., STRAP with artistic paintings or natural images, can resultin superior performance and generalizability, when compared with medically-relevant transformation, i.e ., style transferwith histopathology images and stain augmentation. A possible explanation for this counterintuitive phenomenon is thatstyle transfer using styles derived from the same dataset as the target content cannot completely remove domain-speciﬁccontent, as the distribution of the style and content overlaps, whereas the medically-irrelevant style transfer removesmore domain-speciﬁc components by introducing completely irrelevant texture and color content into the target contentimages. Tobin et al . [27] showed that an object detection model that generalizes to real-world images can be trainedby using unrealistic simulated images with a diverse set of random textures, rather than by making the simulatedimages as realistic as possible. As in the human learning process, learning class-speciﬁc and domain-irrelevant patternsfrom data is essential for deep learning models, and the style transfer technique can be a powerful tool to control therepresentations that models learn. 10lthough data augmentation is widely used when training deep learning models for medical imaging tasks, its potentialhas not yet been fully studied and still remains an active area of research. Moreover, an optimal conﬁguration ofdata augmentation methods may vary among applications. As our study suggests, data augmentation can be a simpleyet powerful tool for learning domain-agnostic representation. Further research is warranted to identify optimal dataaugmentation techniques for a variety of medical imaging tasks, and medically-irrelevant transformations such as theproposed STRAP method should be considered, along with established methods.STRAP achieved higher performance in the out-of-distribution setting, compared to the in-distribution setting, whereasopposite results were observed for the other two baseline approaches and the state-of-the-art models. This is counterintu-itive, considering the widely known problem that deep learning models generalize poorly on unseen out-of-distributiondatasets. In the experiment in Section 2.8, the training data in the in-distribution setting came from a mixed-domainheterogeneous dataset, whereas the training images used for the out-of-distribution setting derived from a single-sourcehomogeneous dataset. Although it is often said that diverse multi-institutional datasets are needed for training modelsthat generalize well on unseen data [28], our study suggests that well-curated homogeneous datasets may provide valuein training domain-agnostic models, if a model has sufﬁcient capability to learn domain-invariant and class-speciﬁcrepresentations, similar to the way in which humans learn from a set of representative examples ( e.g ., content presentedin textbooks).Besides supervised learning, our approach may be applicable to self-supervised learning. A contrastive learningframework, such as SimCLR [29] and MoCo [30], learns representations by maximizing agreement between differentlyaugmented views of the same data example via a contrastive loss (thus, relying heavily on a stochastic data augmentationmodule). Chen et al . [29] showed that the composition of data augmentation operations is crucial in yielding effectiverepresentations, and that unsupervised contrastive learning beneﬁts from strong data augmentation. In medical imaging,contrastive learning may require a tailored composition of data augmentation operations, and our STRAP has thepotential to serve as one of the core transformation operations.One limitation of this study is that we only tested our approach with one particular classiﬁcation task in the ﬁeld ofcomputational pathology. Further studies are warranted to investigate whether our approach could prove its efﬁcacy androbustness 1) for other classiﬁcation tasks, 2) for non-classiﬁcation tasks such as detection, segmentation, and survivalprediction, and 3) in other medical imaging domains, such as radiology, ophthalmology, and dermatology.In conclusion, we have introduced STRAP, a form of data augmentation based on random style transfer with artisticpaintings, for learning domain-agnostic visual representations in computational pathology. Our experiments demon-strated that our approach yields signiﬁcant improvements in test performance on a speciﬁc classiﬁcation task, particularlyin the presence of domain shift. Our study provides evidence that 1) CNNs are reliant on low-level texture content andare therefore vulnerable to domain shifts, and that 2) STRAP can be a practical tool for mitigating that reliance and,therefore, a possible solution for learning domain-agnostic representations in computational pathology. Acknowledgements

This work was funded by the Stanford Departments of Biomedical Data Science and Pathology, through a StanfordClinical Data Science Fellowship to RY. We would like to thank Blaine Burton Rister for detailed and valuable feedbackon the manuscript. We would also like to thank Nandita Bhaskhar, Khaled Kamal Saab, and Jared Dunnmon for theirhelpful discussions.

References [1] Joaquin Quiñonero-Candela, Masashi Sugiyama, Neil D Lawrence, and Anton Schwaighofer.

Dataset Shift inMachine Learning . MIT Press, 2009.[2] Xingchao Peng, Zijun Huang, Ximeng Sun, and Kate Saenko. Domain agnostic learning with disentangledrepresentations. In

ICML , 2019.[3] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel.ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In

International Conference on Learning Representations , 2019.[4] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks.In , pages 2414–2423, June 2016.[5] Philip T G Jackson, Amir Atapour Abarghouei, Stephen Bonner, Toby P Breckon, and Boguslaw Obara. Styleaugmentation: data augmentation via style randomization. In

CVPR Workshops , pages 83–92, 2019.116] Marc Macenko, Marc Niethammer, J S Marron, David Borland, John T Woosley, Xiaojun Guan, Charles Schmitt,and Nancy E Thomas. A method for normalizing histology slides for quantitative analysis. In , pages 1107–1110. ieeexplore.ieee.org,June 2009.[7] David Tellez, Geert Litjens, Péter Bándi, Wouter Bulten, John-Melle Bokhorst, Francesco Ciompi, and Jeroenvan der Laak. Quantifying the effects of data augmentation and stain color normalization in convolutional neuralnetworks for computational pathology.

Med. Image Anal. , 58:101544, December 2019.[8] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprintarXiv:1703.01365 , 2017.[9] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the general-ization of convolutional neural networks. In

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pages 8684–8694, 2020.[10] Bhuvanesh Awasthi, Jason Friedman, and Mark A Williams. Faster, stronger, lateralized: low spatial frequencyinformation supports face processing.

Neuropsychologia , 49(13):3583–3590, November 2011.[11] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In

Proceedings of the IEEE International Conference on Computer Vision , pages 1501–1510, 2017.[12] Rikiya Yamashita, Jin Long, Teri Longacre, Lan Peng, Gerald Berry, Brock Martin, John Higgins, Daniel L Rubin,and Jeanne Shen. Deep learning model for the prediction of microsatellite instability in colorectal cancer: adiagnostic study.

Lancet Oncol. , 22(1):132–141, January 2021.[13] Jakob Nikolas Kather. Histological images for MSI vs. MSS classiﬁcation in gastrointestinal cancer, FFPEsamples [data set] Zenodo. http://doi.org/10.5281/zenodo.2530835 , 2019.[14] The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer.

Nature , 487(7407):330–337, July 2012.[15] Amelie Echle, Heike Irmgard Grabsch, Philip Quirke, Piet A van den Brandt, Nicholas P West, Gordon G AHutchins, Lara R Heij, Xiuxiang Tan, Susan D Richman, Jeremias Krause, Elizabeth Alwers, Josien Jenniskens,Kelly Offermans, Richard Gray, Hermann Brenner, Jenny Chang-Claude, Christian Trautwein, Alexander TPearson, Peter Boor, Tom Luedde, Nadine Therese Gaisa, Michael Hoffmeister, and Jakob Nikolas Kather.Clinical-Grade detection of microsatellite instability in colorectal tumors by deep learning.

Gastroenterology ,159(4):1406–1416.e11, October 2020.[16] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, Dirk Jäger, Jeremias Krause, Sven H Loosen,Alexander Marx, Peter Boor, Frank Tacke, Ulf Peter Neumann, Heike I Grabsch, Takaki Yoshikawa, HermannBrenner, Jenny Chang-Claude, Michael Hoffmeister, Christian Trautwein, and Tom Luedde. Deep learning canpredict microsatellite instability directly from histology in gastrointestinal cancer.

Nat. Med. , 25(7):1054–1056,June 2019.[17] Daisuke Komura and Shumpei Ishikawa. Machine learning methods for histopathological image analysis.

Comput.Struct. Biotechnol. J. , 16:34–42, February 2018.[18] M Sandler, A Howard, M Zhu, A Zhmoginov, and L Chen. MobileNetV2: Inverted residuals and linear bottlenecks.In , pages 4510–4520, June 2018.[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. ImageNet large scale visualrecognition challenge.

Int. J. Comput. Vis. , 115(3):211–252, December 2015.[20] Ning Qian. On the momentum term in gradient descent learning algorithms.

Neural Netw. , 12(1):145–151, January1999.[21] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networksfor one shot learning. In D Lee, M Sugiyama, U Luxburg, I Guyon, and R Garnett, editors,

Advances in NeuralInformation Processing Systems , volume 29, pages 3630–3638. Curran Associates, Inc., 2016.[22] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach tomultiple testing.

J. R. Stat. Soc. Series B Stat. Methodol. , 57(1):289–300, 1995.[23] Pietro Antonio Cicalese, Aryan Mobiny, Pengyu Yuan, Jan Becker, Chandra Mohan, and Hien Van Nguyen. Sty-Path: Style-Transfer data augmentation for robust histology image classiﬁcation. arXiv preprint arXiv:2007.05008 ,2020.[24] Seo Jeong Shin, Seng Chan You, Hokyun Jeon, Ji Won Jung, Min Ho An, Rae Woong Park, and Jin Roh. Styletransfer strategy for developing a generalizable deep learning application in digital pathology.

Comput. MethodsPrograms Biomed. , 198:105815, January 2021. 1225] Agnieszka Mikołajczyk and Michał Grochowski. Style transfer-based image synthesis as an efﬁcient regularizationtechnique in deep learning. In , pages 42–47, 2019.[26] Tamás Nyíri and Attila Kiss. Style transfer for dermatological data augmentation. In

Intelligent Systems andApplications , pages 915–923. Springer International Publishing, 2020.[27] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomiza-tion for transferring deep neural networks from simulation to the real world. arXiv preprint arXiv:1703.06907 ,2017.[28] Christopher J Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King. Key challengesfor delivering clinical impact with artiﬁcial intelligence.

BMC Med. , 17(1):195, October 2019.[29] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastivelearning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[30] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervisedvisual representation learning. In