A Petri Dish for Histopathology Image Analysis
Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour
AA Petri Dish for Histopathology Image Analysis
Jerry Wei , Arief Suriawinata , Bing Ren , Xiaoying Liu , Mikhail Lisovsky ,Louis Vaickus , Charles Brown , Michael Baker , Naofumi Tomita ,Lorenzo Torresani , Jason Wei , Saeed Hassanpour † Dartmouth College Dartmouth-Hitchcock Medical Center † [email protected] Abstract
With the rise of deep learning, there has been increasedinterest in using neural networks for histopathology imageanalysis, a field that investigates the properties of biopsy orresected specimens that are traditionally manually exam-ined under a microscope by pathologists. In histopathol-ogy image analysis, however, challenges such as limiteddata, costly annotation, and processing high-resolution andvariable-size images create a high barrier of entry andmake it difficult to quickly iterate over model designs.Throughout scientific history, many significant researchdirections have leveraged small-scale experimental setupsas petri dishes to efficiently evaluate exploratory ideas,which are then validated in large-scale applications. Forinstance, the Drosophila fruit fly in genetics and MNIST incomputer vision are well-known petri dishes. In this paper,we introduce a m inimalist hist opathology image analysisdataset ( MHIST ), an analogous petri dish for histopathol-ogy image analysis. MHIST is a binary classificationdataset of 3,152 fixed-size images of colorectal polyps,each with a gold-standard label determined by the major-ity vote of seven board-certified gastrointestinal patholo-gists. MHIST also includes each image’s annotator agree-ment level. As a minimalist dataset, MHIST occupies lessthan 400 MB of disk space, and a ResNet-18 baseline canbe trained to convergence on MHIST in just 6 minutes usingapproximately 3.5 GB of memory on a NVIDIA RTX 3090.As example use cases, we use MHIST to study natural ques-tions that arise in histopathology image classification suchas how dataset size, network depth, transfer learning, andhigh-disagreement examples affect model performance.By introducing MHIST, we hope to not only help fa-cilitate the work of current histopathology imaging re-searchers, but also make histopathology image analysismore accessible to the general computer vision commu-nity. Our dataset is available at https://bmirds.github.io/MHIST/ . Figure 1: Key features of our m inimalist hist opathologyimage analysis dataset (MHIST).
1. Introduction
Scientific research has aimed to study and build ourunderstanding of the world, and although many problemsinitially seemed too ambitious, they were ultimately sur-mounted. In these quests, a winning approach has oftenbeen to break down large ideas into smaller components,learn about these components through experiments that canbe iterated on quickly, and then validate or translate thoseideas into large-scale applications. For example, in the Hu-man Genome Project (which helped us understand much ofwhat we know now about human genetics), many funda-mental discoveries resulted from petri dish experiments—small setups that saved time, energy, and money—on sim-pler organisms. In particular, the Drosophila fruit fly, anorganism that is inexpensive to culture, has short life cy-cles, produces large numbers of embryos, and can be easilygenetically modified, has been used in biomedical researchfor over a century to study a broad range of phenomena [1].In deep learning, we have our own set of petri dishes a r X i v : . [ ee ss . I V ] J a n n the form of benchmark datasets, of which MNIST [2] isone of the most popular. Comprising the straightforwardproblem of classifying handwritten digits in 28 by 28 pixelimages, MNIST is easily accessible, and training a strongclassifier on it has become a simple task with today’s tools.Because it is so easy to evaluate models on MNIST, it hasserved as the exploratory environment for many ideas thatwere then validated on large scale datasets or implementedin end-to-end applications. For example, many well-knownconcepts such as convolutional neural networks, generativeadversarial networks [3], and the Adam optimization algo-rithm [4] were initially validated on MNIST.In the field of histopathology image analysis, however,no such classic dataset currently exists due to many po-tential reasons. To start, most health institutions do nothave the technology nor the capacity to scan histopathol-ogy slides at the scale needed to create a reasonably-sizeddataset. Even for institutions that are able to collect data,a barrage of complex data processing and annotation de-cisions falls upon the aspiring researcher, as histopathologyimages are large and difficult to process, and data annotationrequires the valuable time of trained pathologists. Finally,even after data is processed and annotated, it can be chal-lenging to obtain institutional review board (IRB) approvalfor releasing such datasets, and some institutions may wishto keep such datasets private. As a result of the inaccessibil-ity of data, histopathology image analysis has remained onthe fringes of computer vision research, with many popularimage datasets dealing with domains where data collectionand annotation are more straightforward.To address these challenges that have plagued deeplearning for histopathology image analysis since the be-ginning of the area, in this paper, we introduce MHIST :a m inimalist hist opathology image classification dataset.MHIST is minimalist in that it comprises a straightfor-ward binary classification task of fixed-size colorectal polypimages, a common and clinically-significant task in gas-trointestinal pathology. MHIST contains 3,152 fixed-sizeimages, each with a gold-standard label determined fromthe majority vote of seven board-certified gastrointestinalpathologists, that can be used to train a baseline model with-out additional data processing. By releasing this datasetpublicly, we hope not only that current histopathology im-age researchers can build models faster, but also that gen-eral computer vision researchers looking to apply modelsto datasets other than classic benchmarks can easily explorethe exciting area of histopathology image analysis.
2. Background
Deep learning for medical image analysis has recentlyseen increased interest in analyzing histopathology im-ages (large, high-resolution scans of histology slides thatare typically examined under a microscope by patholo- gists) [5]. To date, deep neural networks have alreadyachieved pathologist-level performance on classifying dis-eases such as prostate cancer, breast cancer, lung cancer,and melanoma [6–12], demonstrating their large potential.Despite these successes, histopathology image analysis hasnot seen the same level of popularity as analysis of othermedical image types (e.g., radiology images or CT scans),likely because the nature of histopathology images creates anumber of hurdles that make it challenging to directly applymainstream computer vision methods. Below, we list somefactors that can potentially impede the research workflow inhistopathology image analysis:•
High-resolution, variable-size images.
Because thedisease patterns in histology slides can only occurin certain sections of the tissue and can only be de-tected at certain magnifications under the microscope,histopathology images are typically scanned at highresolution so that all potentially-relevant informationis preserved. This means that while each sample con-tains lots of data, storing these large, high-resolutionimages is non-trivial. For instance, the slides froma single patient in the CAMELYON17 challenge [13]range from 2 GB to 18 GB in size, which is up to one-hundred times larger than the entire CIFAR-10 dataset.Moreover, the size and aspect ratios of the slides candiffer based on the shape of the specimen in question—sometimes, multiple large specimens are scanned intoone slide, and so some slides may be up to an orderof magnitude larger than others. As deep neural net-works typically require fixed-shape inputs, preprocess-ing decisions such as what magnification to analyze theslides at and how to deal with variable-size inputs canbe difficult to make.•
Cost of annotation.
Whereas annotating data in deeplearning has been simplified by services such as Me-chanical Turk, there is no well-established service forannotating histopathology images, a process which re-quires substantial time from experienced pathologistswho are often busy with clinical service. Moreover,access to one or two pathologists is often inadequatebecause inter-annotator agreement is low to moderatefor most tasks, and so annotations can be easily biasedtowards the personal tendencies of annotators.•
Unclear annotation guidelines.
It is also unclearwhat type of annotation is needed for high-resolutionwhole-slide images, as a slide may be given a certaindiagnosis based on a small portion of diseased tissue,but the overall diagnosis would not apply to the nor-mal portions of the tissue. Researchers often opt tohave pathologists draw bounding boxes and annotateareas with their respective histological characteristics,ut this comes with substantial costs, both in teachingpathologists how to use annotation software and in in-creased annotation time and effort.•
Lack of data.
Even once these challenges are ad-dressed, it often the case that, due to slides being dis-carded as a result of poor quality or to remove classesthat are too rare to include in the classification task,training data is relatively limited and the test set is notsufficiently large. This makes it difficult to distinguishaccurately between models, and models are thereforeeasily prone to overfitting.To mitigate these challenges of data collection and anno-tation, in this paper we introduce a minimalist histopathol-ogy dataset that will allow researchers to quickly train ahistopathology classification model without dealing withan avalanche of complex processing and annotation deci-sions. Our dataset focuses on the binary classification ofcolorectal polyps, a straightforward task that is common ina gastrointestinal pathologist’s workflow. Instead of usingwhole-slide images, which are too large to directly train onfor most academic researchers, our dataset consists only of224 ×
224 pixel image tiles of tissue; these images can bedirectly fed into standard computer vision models such asResNet. Finally, for annotations, each patch in our datasetwas directly classified by seven board-certified gastroin-testinal pathologists and given a gold-standard label basedon their majority vote.Our dataset aims to serve as a petri dish for histopathol-ogy image analysis. That is, it represents a simple task thatcan be learned quickly, and it is easy to iterate over. Ourdataset allows researchers to, without dealing with the con-founding factors that arise from the nature of histopathol-ogy images, quickly test inductive biases that can later beimplemented in larger scale applications. We hope thatour dataset will allow researchers to more-easily explorehistopathology image analysis and that this can facilitatefurther research in the field as a whole.
3. MHIST Dataset
In the context of the challenges mentioned in the abovesection, MHIST has several notable features that we viewfavorably in a minimalist dataset:1. Straightforward binary classification setup that is chal-lenging and clinically important.2. Adequate yet tractable number of examples: 2,175training and 977 testing images.3. Fixed-size images of appropriate dimension for stan-dard models. 4. Gold-standard labels from the majority vote of sevenpathologists, along with annotator agreement levelsthat can be used for more-specific model tuning.The rest of this section details the colorectal polyp clas-sification task ( § § § Colorectal cancer is the second leading cause of can-cer death in the United States, with an estimated 53,200deaths in 2020 [14]. As a result, colonoscopy is one ofthe most common cancer screening programs in the UnitedStates [15], and classification of colorectal polyps (growthsinside the colon lining that can lead to colonic cancer if leftuntreated) is one of the highest-volume tasks in pathology.Our task focuses on the clinically-important binary distinc-tion between hyperplastic polyps (HPs) and sessile serratedadenomas (SSAs), a challenging problem with considerableinter-pathologist variability [16–20]. HPs are typically be-nign, while sessile serrated adenomas are precancerous le-sions that can turn into cancer if left untreated and requiresooner follow-up examinations [21]. Pathologically, HPshave a superficial serrated architecture and elongated crypts,whereas SSAs are characterized by broad-based crypts, of-ten with complex structure and heavy serration [22].
For our data collection, we scanned 328 Formalin FixedParaffin-Embedded (FFPE) whole-slide images of colorec-tal polyps, which were originally diagnosed on the whole-slide level as hyperplastic polyps (HPs) or sessile ser-rated adenomas (SSAs), from patients at the Dartmouth-Hitchcock Medical Center. These slides were originallyscanned by an Aperio AT2 scanner at 40x resolution; toincrease the field of view, we compress the slides with 8xmagnification. From these 328 whole-slide images, we thenextracted 3,152 image tiles (portions of size 224 × For data annotation, we worked with seven board-certified gastrointestinal pathologists at the Dartmouth-Hitchcock Medical Center. Each pathologist individuallyand independently classified each image in our dataset as ei-rain Test TotalHP 1,545 617 2,162SSA 630 360 990Total 2,175 977 3,152Table 1: Number of images in our dataset’s training andtesting sets for each class. HP: hyperplastic polyp (benign),SSA: sessile serrated adenoma (precancerous).Figure 2: Distribution of annotator agreement levels for im-ages in our dataset.ther HP or SSA based on the World Health Organization cri-teria from 2019 [23]. After labels were collected for all im-ages from all pathologists, the gold standard label for eachimage was assigned based on the majority vote of the sevenindividual labels, a common choice in literature [24–33].The distribution of each class in our dataset based on thegold standard labels of each image is shown in Table 1.In our dataset, the average percent agreement betweeneach pair of annotators was 72.9% and each pathologistagreed with the majority vote an average of 83.2% of thetime. The mean of the per-pathologist Cohen’s κ was 0.450,in the moderate range of 0.41-0.60. Although not directlycomparable with prior work, a similar evaluation found aCohen’s κ of 0.380 among four pathologists [20]. To facili-tate research that might consider the annotator agreement ofexamples during training, for each image, we also providethe agreement level among our annotators (4/7, 5/7, 6/7, or7/7). Figure 2 shows the distribution of agreement levels (inaddition to the gold-standard labels) for our dataset.
4. Example Use Cases
In this section, we demonstrate example use cases ofour dataset by performing experiments to investigate severalnatural questions that arise in histopathology image anal-ysis. Namely, how does network depth affect model per-
AUC (%) on test set by training set sizeResNet n = 100 n = 200 n = 400 Full18 ± ± ± ± ± ± ± ±
50 64.7 ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Model performance for five different ResNetdepths. Adding more layers to the model does not improveperformance. n indicates the number of images per classused for training. Means and standard deviations shown arefor 10 random seeds.formance ( § § § For our experiments, we follow the DeepSlide coderepository [11] and use the ResNet architecture, a com-mon choice for classifying histopathology images. Specif-ically, for our default baseline, we use ResNet-18 and trainour model for 100 epochs (well past convergence) usingstochastic data augmentation with the Adam optimizer [4],batch size of 32, initial learning rate of × − , and learn-ing rate decay factor of 0.91.For more-robust evaluation, for each model we considerthe five highest AUCs on the test set, which are evaluated atevery epoch. We report the mean and standard deviation ofthese values calculated over 10 different random seeds.Furthermore, We train our models with four differenttraining set sizes: n = 100 , n = 200 , n = 400 , and Full,where n is the number of training images per class and Fullis the entire training set. To obtain subsets of the trainingset, we randomly sample n random images for each classfrom the training set for each seed. We keep our testing setfixed to ensure that models are evaluated equally. We first study whether adding more layers to our modelimproves performance on our dataset. Because deeper mod-els take longer to train, identifying the smallest model thatachieves the best performance allows for maximum accu-racy with the least necessary training time.We evaluate all five ResNet models proposed in[34]—ResNet-18, ResNet-34, ResNet-50, ResNet-101, andResNet-152—on our dataset, and all hyperparameters (e.g.,number of epochs, batch size) are kept constant; we onlychange the model depth.
UC (%) on test set by training set sizePretraining? n = 100 n = 200 n = 400 FullNo 67.4 ± ± ± ± ± ± ± ± Table 3: Using weights pretrained on ImageNet signif-icantly improves the performance of ResNet-18 on ourdataset. n indicates the number of images per class usedfor training. Means and standard deviations shown are for10 random seeds.As shown in Table 2, adding more layers does not sig-nificantly improve performance at any training dataset size.Furthermore, adding model depth past ResNet-34 actuallydecreases performance, as models that are too deep will be-gin to overfit, especially for smaller training set sizes. Forexample, when training with only 100 images per class,mean AUC decreases by 5.1% when using ResNet-152compared to ResNet-18.We posit that increasing network depth does not improveperformance on our dataset because our dataset is rela-tively small, and so deeper networks are unnecessary for theamount of information in our dataset. Moreover, increas-ing network depth may increase overfitting on training datadue to our dataset’s small size. Our results are consistentwith findings presented by Benkendorf and Hawkins [35]—deeper networks only perform better than shallow networkswhen trained with large sample sizes. We also examine the usefulness of transfer learning forour dataset. Transfer learning can often be easily imple-mented into existing models, and so it is helpful to knowwhether or not it can improve performance.Because deeper models do not achieve better perfor-mance on our dataset (as shown in Section 4.2), we useResNet-18 initialized with random weights as the baselinemodel for this experiment. We then compare our baselinewith an identical ResNet-18 model (i.e., all hyperparame-ters are the same) that has been initialized with weights pre-trained on ImageNet.Table 3 shows the results for our ResNet-18 model withand without pretraining. We find that ResNet-18 initial-ized with ImageNet pretrained weights significantly out-performs ResNet-18 initialized with random weights. Forexample, our pretrained model’s performance when trainedwith only 100 images per class is comparable to our base-line model’s performance when trained with the full trainingdataset. When both models are trained on the full trainingset, the pretrained model outperforms the baseline by 8.2%,as measured by mean AUC. These results indicate that, forour dataset, using pretrained weights can be extremely help-ful for improving overall performance. Training Images Used AUC (%) on test setVery Easy Images Only 79.9 ± ± ± ± All Images 84.5 ± As many datasets already contain annotator agreementdata [24–30, 33, 37–41], we also study whether there arecertain ways of selecting examples based on their annotatoragreement level that will maximize performance. Exampleswith high annotator disagreement are, by definition, harderto classify, so they may not always contain features that arebeneficial for training models. For this reason, we focusprimarily on whether training on only examples with higherannotator agreement will improve performance.For our dataset, which was labeled by seven annotators,we partition our images into four discrete levels of diffi-culty: very easy (7/7 agreement among annotators), easy (6/7 agreement among annotators), hard (5/7 agreementamong annotators), and very hard (4/7 agreement amongannotators), following our prior work [42]. We then trainResNet-18 models using different combinations of imagesselected based on difficulty: very easy images only; easyimages only; very easy and easy images; and very easy,easy, and hard images. For this experiment, we do notmodify the dataset size like we did in Sections 4.2 and 4.3,as selecting training images based on difficulty inherentlychanges the training set size.As shown in Table 4, we find that excluding images withhigh annotator disagreement (i.e., hard and very hard im-ages) during training achieves comparable performance totraining with all images. Using only very easy images oronly easy images, however, does not match or exceed per-formance when training with all images. We also find thattraining with all images except very hard images slightlyoutperforms training with all images. One explanation forthis is that very hard images, which only have 4/7 annotator ataset Task Images Image Type Annotation Type Annotators Dataset SizeMITOS (2012) [43] Mitosis Detection 50 High Power Fields Pixel-Level 2 Pathologists ∼ ∼
850 GBCAMELYON17 [13] Metastasis Detection 1,000 Whole Slide Images Contour of Locations 1 Pathologist ∼ ∼ > ∼
13 GBMHIST (Ours) Colorectal Polyp Classification 3,152 Fixed-Sized Images Image-Wise 7 Pathologists ∼
333 MB
Table 5: Comparison of well-known histopathology datasets. Our proposed dataset, MHIST, is advantageous due to itsrelatively small size (making it faster to obtain results) and its robust annotations.agreement, could be too challenging to analyze accurately(even for expert humans), so their features might not be ben-eficial for training machine learning models either.
5. Related Work
Due to the trend towards larger, more computationally-expensive models [48], as well as recent attention on theenvironmental considerations of training large models [49],the deep learning community has begun to question whethermodel development needs to occur at scale. In machinelearning, two recent papers have brought attention to thisidea. Rawal et al. [50] proposed a novel surrogate model forrapid architecture development, an artificial setting that pre-dicts the ground-truth performance of architectural motifs.Greydanus [51] proposed MNIST-1D, a low-compute re-source alternative to MNIST that differentiates more clearlybetween models. Our dataset falls within this direction andis heavily inspired this work.In the histopathology image analysis domain specifi-cally, several datasets are currently available. Perhapsthe two best-known datasets are CAMELYON17 [13] andPCam [45]. CAMELYON17 focuses on breast cancermetastasis detection in whole-slide images (WSI) and in-cludes a training set of 1,000 WSI with labeled locations.CAMELYON17 is well-established, but because it containsWSIs (each taking up > ×
96 pixels extracted fromCAMELYON17. While PCam is similar to our work in thatit considers fixed-size images for binary classification, wenote two key differences. First, the annotations in PCam arederived from bounding-box annotations which were drawnby a student and then checked by a pathologist. For chal-lenging tasks with high annotator disagreement, however,using only a single annotator can cause the model to learnspecific tendencies of a single pathologist. In our dataset, onthe other hand, each image is directly classified by seven ex-pert pathologists, and the gold standard is set as the major-ity vote of the seven labels, mitigating the potential biases that can arise from having only a single annotator. Second,whereas PCam takes up around 7 GB of disk space, ourdataset aims to be minimalist and is therefore an order ofmagnitude smaller, making it faster for researchers to ob-tain results and iterate over models.In Table 5, we compare our dataset with otherpreviously-proposed histopathology image analysisdatasets. Our dataset is much smaller than other datasets,yet it still has enough examples to serve as a petri dish inthat it can test models and return results quickly. Addi-tionally, our dataset has robust annotations in comparisonto other histopathology datasets. Datasets frequently onlyhave one or two annotators, but MHIST is annotated byseven pathologists, making it the least influenced by biasesthat any singular annotator may have.
6. Discussion
The inherent nature of histopathology image classifica-tion can create challenges for researchers looking to ap-ply mainstream computer vision methods. Not only arehistopathology images themselves difficult to handle be-cause they have high resolutions and are variable sized, butaccurately and efficiently annotating histopathology imagesis a nontrivial task. Furthermore, being able to addressthese challenges does not guarantee a high-quality dataset,as histopathology images are difficult to acquire, and so datais often quite limited.Based on a thorough analysis of these challenges, wehave presented MHIST, a histopathology image classifi-cation dataset with a straightforward yet challenging bi-nary classification task. MHIST comprises a total of 3,152fixed-size images that have been appropriately preprocessedfor inputting into standard computer vision models such asResNet. In addition to providing these images, we also in-clude each image’s gold standard label and their degree ofannotator agreement.We aim to have provided a dataset that can serve asa petri dish for histopathology image analysis. We hopethat researchers are able to use MHIST to test models on asmaller scale before being implemented in large-scale appli-cations, and that our dataset will facilitate further researchinto deep learning for histopathology image analysis. eferences [1] Jennings, B.H.: Drosophila – a versatile model in biology& medicine. Materials Today (5), 190 – 195 (2011), [2] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceed-ings of the IEEE (11), 2278–2324 (1998), https://ieeexplore.ieee.org/document/726791 [3] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Gen-erative adversarial networks (2014), https://arxiv.org/pdf/1406.2661.pdf [4] Kingma, D., Ba, J.: Adam: A method for stochastic op-timization. International Conference on Learning Repre-sentations (2014), https://arxiv.org/pdf/1412.6980.pdf [5] Srinidhi, C.L., Ciga, O., Martel, A.L.: Deep neuralnetwork models for computational histopathology: Asurvey. Medical Image Analysis p. 101813 (2020), [6] Arvaniti, E., Fricker, K.S., Moret, M., Rupp, N., Her-manns, T., Fankhauser, C., Wey, N., Weild, P.J., Ruschoff,J.H., Claassen, M.: Automated gleason grading of prostatecancer tissue microarrays via deep learning. Nature Sci-entific Reports (2018), [7] Bulten, W., Pinckaers, H., van Boven, H., Vink, R., de Bel,T., van Ginneken, B., van der Laak, J., Hulsbergen-van deKaa, C., Litjens, G.: Automated deep-learning system forgleason grading of prostate cancer using biopsies: a diag-nostic study. The Lancet Oncology (2), 233–241 (2020), https://arxiv.org/pdf/1907.07980.pdf [8] Hekler, A., Utikal, J.S., Enk, A.H., Berking, C., Klode,J., Schadendorf, D., Jansen, P., Franklin, C., Holland-Letz, T., Krahl, D., von Kalle, C., Fr¨ohling, S., Brinker,T.J.: Pathologist-level classification of histopatholog-ical melanoma images with deep neural networks.European Journal of Cancer , 79 – 83 (2019), [9] Shah, M., Wang, D., Rubadue, C., Suster, D., Beck, A.: Deeplearning assessment of tumor proliferation in breast can-cer histological images. In: 2017 IEEE International Con-ference on Bioinformatics and Biomedicine (BIBM). pp.600–603 (2017), https://ieeexplore.ieee.org/abstract/document/8217719 [10] Str¨om, P., Kartasalo, K., Olsson, H., Solorzano, L., De-lahunt, B., Berney, D.M., Bostwick, D.G., Evans, A.J.,Grignon, D.J., Humphrey, P.A., Iczkowski, K.A., Kench,J.G., Kristiansen, G., van der Kwast, T.H., Leite, K.R.M.,McKenney, J.K., Oxley, J., Pan, C., Samaratunga, H.,Srigley, J.R., Takahashi, H., Tsuzuki, T., Varma, M., Zhou, M., Lindberg, J., Bergstr¨om, C., Ruusuvuori, P., W¨ahlby,C., Gr¨onberg, H., Rantalainen, M., Egevad, L., Eklund, M.:Pathologist-level grading of prostate biopsies with artificialintelligence. CoRR (2019), http://arxiv.org/pdf/1907.01368 [11] Wei, J.W., Tafe, L.J., Linnik, Y.A., Vaickus, L.J.,Tomita, N., Hassanpour, S.: Pathologist-level classi-fication of histologic patterns on resected lung ade-nocarcinoma slides with deep neural networks. Sci-entific Reports (2019), [12] Zhang, Z., Chen, P., McCough, M., Xing, F., Wang, C.,Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., Ahmad,N., Khalil, F.K., Dickinson, S.I., Shi, X., Liu, F., Su, H.,Cai, J., Yang, L.: Pathologist-level interpretable whole-slidecancer diagnosis with deep learning. Nature Machine Intelli-gence , 236–245 (2019), [13] B´andi, P., Geessink, O., Manson, Q., Van Dijk, M., Balken-hol, M., Hermsen, M., Ehteshami Bejnordi, B., Lee, B.,Paeng, K., Zhong, A., Li, Q., Zanjani, F.G., Zinger, S.,Fukuta, K., Komura, D., Ovtcharov, V., Cheng, S., Zeng,S., Thagaard, J., Dahl, A.B., Lin, H., Chen, H., Jacobsson,L., Hedlund, M., C¸ etin, M., Halıcı, E., Jackson, H., Chen,R., Both, F., Franke, J., K¨usters-Vandevelde, H., Vreuls,W., Bult, P., van Ginneken, B., van der Laak, J., Litjens,G.: From detection of individual metastases to classifica-tion of lymph node status at the patient level: The came-lyon17 challenge. IEEE Transactions on Medical Imaging (2), 550–560 (2019), https://ieeexplore.ieee.org/document/8447230 [14] Colorectal cancer statistics. , accessed: 2021-01-06[15] Rex, D.K., Boland, C.R., Dominitz, J.A., Giardiello, F.M.,Johnson, D.A., Kaltenbach, T., Levin, T.R., Lieberman, D.,Robertson, D.J.: Colorectal cancer screening: Recommenda-tions for physicians and patients from the u.s. multi-societytask force on colorectal cancer. Gastroenterology , 307–323 (2017), [16] Abdeljawad, K., Vemulapalli, K.C., Kahi, C.J., Cummings,O.W., Snover, D.C., Rex, D.K.: Sessile serrated polyp preva-lence determined by a colonoscopist with a high lesion detec-tion rate and an experienced pathologist. Gastrointestinal En-doscopy , 517–524 (2015), https://pubmed.ncbi.nlm.nih.gov/24998465/ [17] Farris, A.B., Misdraji, J., Srivastava, A., Muzikansky, A.,Deshpande, V., Lauwers, G.Y., Mino-Kenudson, M.: Ses-sile serrated adenoma: challenging discrimination from otherserrated colonic polyps. The American Journal of SurgicalPathology , 30–35 (2008), https://pubmed.ncbi.nlm.nih.gov/18162767/ [18] Glatz, K., Pritt, B., Glatz, D., HArtmann, A., O’Brien, M.J.,Glaszyk, H.: A multinational, internet-based assessment ofobserver variability in the diagnosis of serrated colorectalolyps. American Journal of Clinical Pathology , 938–945 (2007), https://pubmed.ncbi.nlm.nih.gov/17509991/ [19] Khalid, O., Radaideh, S., Cummings, O.W., O’brien,M.J., Goldblum, J.R., Rex, D.K.: Reinterpretation ofhistology of proximal colon polyps called hyperplasticin 2001. World Journal of Gastroenterology , 3767–3770 (2009), [20] Wong, N.A.C.S., Hunt, L.P., Novelli, M.R., Shepherd,N.A., Warren, B.F.: Observer agreement in the diagno-sis of serrated polyps of the large bowel. Histopathology (2009), https://pubmed.ncbi.nlm.nih.gov/19614768/ [21] Understanding your pathology report: Colon polyps (ses-sile or traditional serrated adenomas). , accessed: 2021-01-06[22] Gurudu, S.R., Heigh, R.I., Petries, G.D., Heigh, E.G.,Leighton, J.A., Pasha, S.F., Malagon, I.B., Das, A.: Ses-sile serrated adenomas: Demographic, endoscopic andpathological characteristics. World Journal of Gastroenterol-ogy , 3402–3405 (2010), [23] Nagtegaal, I.D., Odze, R.D., Klimstra, D., Paradis,V., Rugge, M., Schirmacher, P., Washington, K.M.,Carneiro, F., Cree, I.A., of Tumours Editorial Board,W.C.: The 2019 who classification of tumours of thedigestive system. Histopathology (2), 182–188 (2020), https://onlinelibrary.wiley.com/doi/abs/10.1111/his.13975 [24] Chilamkurthy, S., Ghosh, R., Tanamala, S., Biviji, M.,Campeau, N.G., Venugopal, V.K., Mahajan, V., Rao, P.,Warier, P.: Deep learning algorithms for detection of criticalfindings in head ct scans: a retrospective study. The Lancet , 2388–2396 (2018), [25] Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu,D., Narayanaswamy, A., Venugopalan, S., Widner, K.,Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson,P.C., Mega, J.L., Webster, D.R.: Development and Vali-dation of a Deep Learning Algorithm for Detection of Di-abetic Retinopathy in Retinal Fundus Photographs. JAMA (22), 2402–2410 (2016), https://jamanetwork.com/journals/jama/fullarticle/2588763 [26] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S.,Chute, C., Marklund, H., Haghgoo, B., Ball, R.L., Shpan-skaya, K.S., Seekins, J., Mong, D.A., Halabi, S.S., Sand-berg, J.K., Jones, R., Larson, D.B., Langlotz, C.P., Patel,B.N., Lungren, M.P., Ng, A.Y.: Chexpert: A large chestradiograph dataset with uncertainty labels and expert com- parison. CoRR abs/1901.07031 (2019), http://arxiv.org/pdf/1901.07031 [27] Kanavati, F., Toyokawa, G., Momosaki, S., Rambeau, M.,Kozuma, Y., Shoji, F., Yamazaki, K., Takeo, S., Iizuka,O., Tsuneki, M.: Weakly-supervised learning for lungcarcinoma classification using deep learning. Nature Sci-entific Reports (2020), [28] Korbar, B., Olofson, A.M., Miraflor, A.P., Nicka, C.M.,Suriawinata, M.A., Torresani, L., Suriawinata, A.A., Has-sanpour, S.: Deep learning for classification of colorectalpolyps on whole-slide images. Journal of Pathology Infor-matics (2017), [29] Sertel, O., Kong, J., Catalyurek, U.V., Lozanski, G., Saltz,J.H., Gurcan, M.N.: Histopathological image analysis usingmodel-based intermediate representations and color texture:Follicular lymphoma grading. Journal of Signal ProcessingSystems (2009), https://link.springer.com/article/10.1007/s11265-008-0201-y [30] Wang, S., Xing, Y., Zhang, L., Gao, H., Zhang, H.:Deep convolutional neural network for ulcer recognitionin wireless capsule endoscopy: Experimental feasibilityand optimization. Computational and Mathematical Methodsin Medicine (2019), [31] Wei, J., Wei, J., Jackson, C., Ren, B., Suriawinata,A., Hassanpour, S.: Automated detection of celiacdisease on duodenal biopsy slides: A deep learning ap-proach. Journal of Pathology Informatics (1), 7 (2019), [32] Wei, J., Suriawinata, A., Liu, X., Ren, B., Nasir-Moin,M., Tomita, N., Wei, J., Hassanpour, S.: Difficulty transla-tion in histopathology images. In: Artificial Intelligence inMedicine (AIME) (2020), https://arxiv.org/pdf/2004.12535.pdf [33] Zhou, J., Luo, L.Y., Dou, Q., Chen, H., Chen, C., Li, G.J.,Jiang, Z.F., Heng, P.A.: Weakly supervised 3d deep learn-ing for breast cancer classification and localization of the le-sions in mr images. Journal of Magnetic Resonance Imaging (4), 1144–1151 (2019), https://onlinelibrary.wiley.com/doi/abs/10.1002/jmri.26721 [34] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learn-ing for image recognition. CoRR abs/1512.03385 (2015), http://arxiv.org/pdf/1512.03385 [35] Benkendorf, D.J., Hawkins, C.P.: Effects of samplesize and network depth on a deep learning approach tospecies distribution modeling. Ecological Informatics ,101137 (2020), [36] Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet mod-els transfer better? CoRR abs/1805.08974 (2018), http://arxiv.org/pdf/1805.08974
37] Coudray, N., Moreira, A.L., Sakellaropoulos, T., Feny¨o,D., Razavian, N., Tsirigos, A.: Classification andmutation prediction from non-small cell lung cancerhistopathology images using deep learning. Nature Medicine , 1559–1567 (2017), [38] Ehteshami Bejnordi, B., Veta, M., Johannes van Di-est, P., van Ginneken, B., Karssemeijer, N., Litjens,G., van der Laak, J.A.W.M.: Diagnostic Assessmentof Deep Learning Algorithms for Detection of LymphNode Metastases in Women With Breast Cancer. JAMA (22), 2199–2210 (2017), https://jamanetwork.com/journals/jama/fullarticle/2665774 [39] Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M.,Blau, H.M., Thrun, S.: Dermatologist-level classification ofskin cancer with deep neural networks. Nature , 115–118 (2017), [40] Ghorbani, A., Natarajan, V., Coz, D., Liu, Y.: Der-mgan: Synthetic generation of clinical skin images withpathology (2019), https://arxiv.org/pdf/1911.08716.pdf [41] Wei, J.W., Suriawinata, A.A., Vaickus, L.J., Ren, B.,Liu, X., Lisovsky, M., Tomita, N., Abdollahi, B.,Kim, A.S., Snover, D.C., Baron, J.A., Barry, E.L.,Hassanpour, S.: Evaluation of a Deep Neural Net-work for Automated Classification of Colorectal Polypson Histopathologic Slides. JAMA Network Open (4)(2020), https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2764906 [42] Wei, J., Suriawinata, A., Ren, B., Liu, X., Lisovsky, M.,Vaickus, L., Brown, C., Baker, M., Nasir-Moin, M., Tomita,N., Torresani, L., Wei, J., Hassanpour, S.: Learn like apathologist: Curriculum learning by annotator agreement forhistopathology image classification. In: Winter Conferenceon Applications of Computer Vision (WACV) (2020)[43] Cires¸an, D.C., Giusti, A., Gambardella, L.M., Schmidhu-ber, J.: Mitosis detection in breast cancer histology im-ages with deep neural networks. In: Mori, K., Sakuma,I., Sato, Y., Barillot, C., Navab, N. (eds.) Medical Im-age Computing and Computer-Assisted Intervention – MIC-CAI 2013. pp. 411–418. Springer Berlin Heidelberg, Berlin,Heidelberg (2013), https://link.springer.com/chapter/10.1007/978-3-642-40763-5_51 [44] Veta, M., Heng, Y.J., Stathonikos, N., Bejnordi, B.E., Beca,F., Wollmann, T., Rohr, K., Shah, M.A., Wang, D., Rousson,M., Hedlund, M., Tellez, D., Ciompi, F., Zerhouni, E., Lanyi,D., Viana, M., Kovalev, V., Liauchuk, V., Phoulady, H.A.,Qaiser, T., Graham, S., Rajpoot, N., Sj¨oblom, E., Molin, J.,Paeng, K., Hwang, S., Park, S., Jia, Z., Chang, E.I.C., Xu,Y., Beck, A.H., van Diest, P.J., Pluim, J.P.: Predicting breasttumor proliferation from whole-slide images: The tupac16challenge. Medical Image Analysis , 111 – 121 (2019).https://doi.org/https://doi.org/10.1016/j.media.2019.02.012, [45] Veeling, B.S., Linmans, J., Winkens, J., Cohen, T.,Welling, M.: Rotation equivariant CNNs for digital pathol-ogy (Jun 2018), https://arxiv.org/pdf/1806.03962.pdf [46] Aresta, G., Ara´ujo, T., Kwok, S., Chennamsetty, S.S.,Safwan, M., Alex, V., Marami, B., Prastawa, M., Chan,M., Donovan, M., Fernandez, G., Zeineh, J., Kohl,M., Walz, C., Ludwig, F., Braunewell, S., Baust, M.,Vu, Q.D., To, M.N.N., Kim, E., Kwak, J.T., Galal, S.,Sanchez-Freire, V., Brancati, N., Frucci, M., Riccio, D.,Wang, Y., Sun, L., Ma, K., Fang, J., Kone, I., Boul-mane, L., Campilho, A., Eloy, C., Pol´onia, A., Aguiar,P.: Bach: Grand challenge on breast cancer histologyimages. Medical Image Analysis , 122 – 139 (2019).https://doi.org/https://doi.org/10.1016/j.media.2019.05.010, [47] Swiderska-Chadaj, Z., Pinckaers, H., van Rijthoven, M.,Balkenhol, M., Melnikova, M., Geessink, O., Manson,Q., Sherman, M., Polonia, A., Parry, J., Abubakar, M.,Litjens, G., van der Laak, J., Ciompi, F.: Learning todetect lymphocytes in immunohistochemistry with deeplearning. Medical Image Analysis , 101547 (2019).https://doi.org/https://doi.org/10.1016/j.media.2019.101547, [48] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., et al.: Language models are few-shot learn-ers. arXiv preprint arXiv:2005.14165 (2020), https://arxiv.org/pdf/2005.14165.pdf [49] Strubell, E., Ganesh, A., McCallum, A.: Energy and pol-icy considerations for deep learning in nlp. arXiv preprintarXiv:1906.02243 (2019), https://arxiv.org/pdf/1906.02243.pdf [50] Rawal, A., Lehman, J., Such, F.P., Clune, J., Stanley, K.O.:Synthetic petri dish: A novel surrogate model for rapid ar-chitecture search. arXiv preprint arXiv:2005.13092 (2020), https://arxiv.org/pdf/2005.13092.pdf [51] Greydanus, S.: Scaling down deep learning. arXiv preprintarXiv:2011.14439 (2020),[51] Greydanus, S.: Scaling down deep learning. arXiv preprintarXiv:2011.14439 (2020),