[PDF] Confounding variables can degrade generalization performance of radiological deep learning models

Abstract

Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have been promising, but it has not yet been shown that models trained on x-rays from one hospital or one group of hospitals will work equally well at different hospitals. Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems. A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-rays from NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana (n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outside hospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were able to detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly. The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identify disease-specific imaging findings on x-rays, but also their ability to exploit confounding information. Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.

Full PDF

CConfounding variables can degrade generalization performance ofradiological deep learning models

John R. Zech , Marcus A. Badgeley , Manway Liu , Anthony B. Costa , Joseph J. Titano , Eric K.Oermann Department of Medicine, California Pacific Medical Center, San Francisco, CA 94115 [email protected] Verily Life Sciences, 269 E Grand Ave, South San Francisco, CA 94080 [email protected], [email protected] Department of Neurological Surgery, Icahn School of Medicine, New York, NY 10029 [email protected], [email protected] Department of Radiology, Icahn School of Medicine, New York, NY 10029 [email protected] * These authors contributed equally to this work

Author summary

Early results in using convolutional neural networks (CNNs) on x-rays to diagnose disease have beenpromising, but it has not yet been shown that models trained on x-rays from one hospital or one group ofhospitals will work equally well at different hospitals. Before these tools are used for computer-aideddiagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospitalsystems.A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest x-raysfrom NIH (n=112,120 from 30,805 patients), Mount Sinai (42,396 from 12,904 patients), and Indiana(n=3,807 from 3,683 patients). In 3 / 5 natural comparisons, performance on chest x-rays from outsidehospitals was significantly lower than on held-out x-rays from the original hospital systems. CNNs were ableto detect where an x-ray was acquired (hospital system, hospital department) with extremely high accuracyand calibrate predictions accordingly.The performance of CNNs in diagnosing diseases on x-rays may reflect not only their ability to identifydisease-specific imaging findings on x-rays, but also their ability to exploit confounding information.Estimates of CNN performance based on test data from hospital systems used for model training mayoverstate their likely real-world performance.

Abstract

Background : There is interest in using convolutional neural networks (CNNs) to analyze medical imagingto provide computer aided diagnosis (CAD). Recent work has suggested that image classification CNNs maynot generalize to new data as well as previously believed. We assessed how well CNNs generalized acrossthree hospital systems for a simulated pneumonia screening task.

Methods and Findings : A cross-sectional design with multiple model training cohorts was used toevaluate model generalizability to external sites. 158,323 chest radiographs were drawn from threeinstitutions: NIH (112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients),July 16, 2018 1/15 a r X i v : . [ c s . C V ] J u l nd Indiana (IU; 3,807 radiographs from 3,683 patients). These patient populations had age mean (S.D.) 46.9(16.6), 63.2 (16.5), and 49.6 (17), and percent female 43.5%, 44.8%, and 57.1%, respectively. We assessedindividual models using area under the receiver operating characteristic curve (AUC) for radiographicfindings consistent with pneumonia and compared performance on different test sets with DeLong’s test. Theprevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) thatmerely sorting by hospital system achieved an AUC of 0.861 on the joint MSH-NIH dataset. Models trainedon data from either NIH or MSH had equivalent performance on IU (p-values 0.580 and 0.273, respectively)and inferior performance on data from each other relative to an internal test set (i.e., new data from withinthe hospital system used for training data; p-values both < < < Conclusions : Pneumonia screening CNNs achieved better internal than external performance in 3 / 5natural comparisons. When models were trained on pooled data from sites with different pneumoniaprevalence, they performed better on new pooled data from these sites but not on external data. CNNsrobustly identified hospital system and department within a hospital which can have large differences indisease burden and may confound disease predictions.

Introduction

There is significant interest in using convolutional neural networks (CNNs) to analyze radiology, pathology,or clinical imaging for the purposes of computer aided diagnosis (CAD) [1–5]. These studies are generallyperformed utilizing CNN techniques that were pioneered on well characterized computer vision datasetsincluding the ImageNet Large Scale Visual Recognition Competition (ILSVRC) and the Modified NationalInstitute of Standards and Technology (MNIST) database of hand drawn digits [6, 7]. Training CNNs toclassify images from these datasets is typically done by splitting the dataset into three subsets: train (datadirectly used to learn parameters for models), tune (data used to choose hyperparameter settings, alsocommonly referred to as ‘validation’), and test (data used exclusively for performance evaluation of modelslearned using train and tune data). CNNs are trained to completion with the first two, and the final set isused to estimate the model’s expected performance on new, previously unseen data.An underlying premise of the test set implying future generalizability to new data is that the test set isreflective of the data that will be encountered elsewhere. Recent work in computer vision has demonstratedthat the true generalization performance of even classic CIFAR-10 photograph classification CNNs to newdata may be lower than previously believed [8]. In the biomedical imaging context, we can contrast ‘internal’model performance on new, previously unseen data gathered from the same hospital system(s) used formodel training with ‘external’ model performance on new, previously unseen data from different hospitalsystems [9, 10]. External test data may be different in important ways from internal test data, and this mayaffect model performance, particularly if confounding variables exist in internal data that do not exist inexternal data [11]. In a large scale deep learning study of retinal fundoscopy, Ting et al. (2017) notedvariation in performance of CNNs trained to identify ocular disease across external hospital systems, witharea under the receiver operating characteristic curve (AUC) ranging from 0.889 to 0.983 and image-levelconcordance with human experts ranging from 65.8% to 91.2% on external datasets [4]. Despite the rapidJuly 16, 2018 2/15ush to develop deep learning systems on radiological data for academic and commercial purposes, to date,no study has looked at whether radiological CNNs actually generalize to external data. If external testperformance of a system is inferior to internal test performance, clinicians may erroneously believe systems tobe more accurate than they truly are in the deployed context, creating the potential for patient harm.The primary aim of this study was to obtain data from three separate hospital systems and to assess howwell deep learning models trained at one hospital system generalized to other external hospital systems. Forthe purposes of this assessment, we chose the diagnosis of pneumonia on chest x-ray for both its clinicalsignificance as well as common occurrence and significant interest [2]. By training and testing models ondifferent partitions of data across three distinct institutions, we sought to establish whether a trulygeneralizable model could be learned as well as which factors affecting external validity could be identified toaid clinicians when assessing models for potential clinical deployment.

Methods

Datasets

This study was approved by the Mount Sinai Health System Institutional Review Board; the requirement forpatient consent was waived for this retrospective study that was deemed to carry minimal risk. Threedatasets were obtained from different hospital groups: National Institutes of Health Clinical Center (NIH;112,120 radiographs from 1992-2015), Indiana University Network for Patient Care (IU; 7,470 radiographs,date range not available), and Mount Sinai Hospital (MSH; 48,915 radiographs from 2009-2016) [1, 12]. Thisstudy did not have a prospective analysis plan, and all analyses performed are subsequently described.

Convolutional Neural Networks (CNNs)

Deep learning encompasses any algorithm that uses multiple layers of feed-forward neural networks to modelphenomena [13]. Classification CNNs are a type of supervised deep learning model that take an image asinput and predict the probability of predicted class membership as output. A typical use of CNNs isclassifying photographs according to the animals or objects they contain: a chihuahua, a stove, a speedboat,etc. [6]. Many different CNN architectures have been proposed, including ResNet-50 and DenseNet-121 usedin this paper, and improving the performance of these models is an active area of research [14, 15]. Inpractice, CNNs are frequently pre-trained on large computer vision databases, such as ImageNet, rather thanbeing randomly initialized and trained de novo. After pre-training, the CNNs are then fine-tuned on thedataset of interest. This process of pre-training followed by fine-tuning reduces training time, promotesmodel convergence, and can regularize the model to reduce overfitting. A difficulty of using these models isthat there are few formal guarantees as to their generalization performance [16]. In this paper, we use CNNsboth to preprocess and to predict pneumonia in radiographs.

Preprocessing: Frontal View Filtering

NIH data contained only frontal chest radiographs, while IU and MSH data contained both frontal andlateral chest radiographs and were found to contain inconsistent frontal and lateral labels on manual review.402 IU and 490 MSH radiographs were manually labeled as frontal/lateral and randomly divided into groups(IU: 200 train, 100 tune, 102 test; MSH: 200 train, 100 tune, 190 test) and used to train ResNet-50 CNNs toidentify frontal radiographs [14]. 187/190 MSH and 102/102 IU test radiographs were accurately classified.The datasets were then filtered to frontal radiographs using these CNNs, leaving a total of 158,323radiographs (112,120 NIH, 42,396 MSH, and 3,807 IU) available for analysis (Figure 1).July 16, 2018 3/15 ig 1.

Preprocessing approach.

Preprocessing: Generating Labels for Pathology

IU radiographs were manually labeled by curators after review of the accompanying text radiologyreports [12]. NIH radiographs were labeled automatically using a proprietary natural language processing(NLP) system based on expanding sentences as parse trees and using hand-crafted rules based on the MESHvocabulary to identify statements indicating positive pathology [2].MSH radiographs did not initially include labels, so a subset of radiographic reports were manually labelledto train an NLP algorithm that could infer labels for the full dataset. 405 radiographic reports weremanually labeled for cardiomegaly, emphysema, effusion, hernia, nodule, atelectasis, pneumonia, edema, andconsolidation. To evaluate the NLP algorithm’s performance, these were split into train and test groups (283and 122, respectively). A previously described NLP concept extraction model based on 1- and 2-grambag-of-words features with Lasso logistic regression was trained to identify reports positive for thesefindings [17]. AUC, sensitivity, and specificity at a 50% classification threshold are reported in in Table 1.The NLP model was then refit with all 405 manually labelled reports and used to process all unlabelledreports. As reports positive for hernia occurred too infrequently to use this NLP algorithm, reports wereautomatically labeled as positive for hernia if the word ‘hernia’ appeared in the report.July 16, 2018 4/15 able 1.

Performance of NLP algorithm on 30% test data.

Finding AUC Sensitivity SpecificityCardiomegaly

Emphysema

Effusion

Atelectasis

Pneumonia

Edema

Consolidation

Nodule

Preprocessing: Separation of Patients Across Train / Tune / Test Groups

As NIH and MSH data contained patient identifiers, all NIH patients were separated into fixed train (70%),tune (10%), and test (20%) groups (Figure 2). IU data did not contain patient identifiers. In the case ofpneumonia detection, 100% of IU data was reserved for use as an external test set. IU data was used fortraining only to detect hospital system, and in this case was separated into fixed train (70%), tune (10%),and test (20%) groups using an identifier corresponding to accession number (e.g., which radiographs wereobtained at the same time on the same patient). Test data was not available to CNNs during model training,and all results reported in this study are calculated exclusively on test data.

Fig 2.

Cohort splitting diagram.July 16, 2018 5/15 reprocessing: Identifying Mount Sinai Portable Scans From Inpatient Wardsand Emergency Department

Of 42,396 MSH radiographs, 39,574 contained a label indicating whether they were portable radiographs;31,838 were labeled as portable. We identified a subset of 31,076 MSH portable radiographs that documentedthe department of acquisition, with 28,841 from inpatient wards and 2,235 from the emergency department.

Model Training

PyTorch 0.2.0 and torchvision were used for model training [18]. All images were resized to 224 x 224. CNNsused for experiments were trained with DenseNet-121 architecture with an additional dense layer (n=15)attached to the original bottleneck layer and sigmoid activation (for binary classification) or a linear layerwith output dimension equal to that of the classification label followed by softmax activation (for n > Internal and External Performance Testing

To assess how individual models trained using single datasets would generalize compared to a model trainedsimultaneously on multiple datasets, we trained CNNs to predict nine overlapping diagnoses (cardiomegaly,emphysema, effusion, hernia, nodule, atelectasis, pneumonia, edema, and consolidation) using 3 differenttrain set combinations: NIH, MSH, and a joint NIH-MSH train set. We were interested only in theprediction of pneumonia and included other diagnoses to improve overall model training and performance.For each model, we calculated AUC, accuracy, sensitivity, and specificity for 4 different test sets: jointNIH-MSH, NIH only, MSH only, and IU. We report differences in test AUC for all possible internal-externalcomparisons. We consider the joint MSH-NIH test set the internal comparison set for the jointly trainedmodel. We additionally report differences in test AUC between a jointly trained MSH-NIH model andindividual MSH-NIH test sets. Classification threshold was set to ensure 95% sensitivity on each test set tosimulate model use for a theoretical screening task. After external review of this analysis, a trivial modelthat ranked cases based only on the average pneumonia prevalence in each hospital system’s training dataand completely ignored radiographic findings was evaluated on the MSH-NIH test set to evaluate howhospital system alone can predict pneumonia in the joint dataset.

Hospital System and Department Prediction

After training models for pneumonia and evaluating their performance across sites, additional analysis wasplanned to better understand a CNN’s ability to detect site and department and how that could affectpneumonia prediction. We trained a CNN to predict hospital system from radiographs to assess whetherlocation information was directly detectable from the radiograph alone. Radiographs from all three hospitalsystems were utilized to learn a model that could identify the hospital system from which a given radiographwas drawn. To develop this concept more granularly, for MSH radiographs, we further identified from whichhospital unit individual radiographs were obtained (inpatient wards, emergency department). In all cases, wereport the classification accuracy on a held-out test set.July 16, 2018 6/15 ample Activation Maps

We created 7x7 sample activation maps following Zhou et al. (2015) to attempt to understand whichlocations in chest radiographs provided strong evidence for hospital system [19]. For this experiment, wespecifically identify radiographs from the NIH. For a sample of NIH test radiographs (n=100) we averagedthe softmax probability for each subregion calculated as P ( hospital == N IH | radiograph i,j ) = e Yi,jNIH e Yi,jNIH + e Yi,jMSH + e Yi,jIU where i, j corresponds to the subregion at the i th row and j th column of the final convolutional layer (7x7 =49 subregions), where each Y i,j Hospital System = (cid:80) K ( β k Hospital System ∗ X k,i,j ) + β Hospital System where the sum is performed over the K final convolutional layers and X k,i,j represents the activation at the i th row and j th column of the k th final convolutional layer. To characterize how many different subregionswere typically involved in NIH hospital system classification, we report the mean, minimum, and maximumnumber of subregions that predicted NIH decisively (probability > = 95%). To illustrate the contribution ofparticularly influential features (e.g., laterality labels) to classification, we present several examples ofheatmaps generated by calculating Y i,jNIH − Y i,jMSH − Y i,jIU for all i, j subregions in an image andsubtracting the mean. This additional calculation was necessary to distinguish their positive contribution inthe context of many subregions contributing positively to classification probability. Engineered Relative Risk Experiment

We wished to assess the hypothesis that a CNN could learn to exploit large differences in pathologyprevalence between two hospital systems in training data by calibrating its predictions to the baselineprevalence at each hospital system, rather than exclusively discriminating based on direct pathology findings.This would lead to strong performance on a test dataset consisting of imbalanced data from both hospitalsystems but would fail to generalize to data from an external hospital system. To test this hypothesis, wesimulated experimental cohorts that differed only in relative disease prevalence and performed internal andexternal evaluations as described above. Five cohorts of 20,000 patients consisting of 10,000 MSH and 10,000NIH patients were specifically sampled to artificially set different levels of pneumonia prevalence in eachpopulation, while maintaining a constant overall prevalence: NIH Severe (NIH 9.9%, MSH 0.1%), NIH Mild(NIH 9%, MSH 1%), Balanced (NIH 5%, MSH 5%), MSH Mild (MSH 9%, NIH 1%), MSH Severe (MSH9.9%, NIH 0.1%). The sampling routine also ensured that males and females had equal prevalence ofpneumonia. We refer to these as ‘engineered prevalence cohorts.’ Train / tune / test splits consistent withprior modeling were maintained for these experiments. CNNs were trained on each cohort in the fashionpreviously described, and test AUCs on internal joint MSH-NIH and external IU data were compared.

Statistical Methods

To assess differences between classification models, we used either the paired or unpaired version of DeLong’stest for ROC curves as appropriate [20]. Comparisons between proportions were performed utilizing χ tests,and all p-values were assessed at an alpha of 0.05. Statistical analysis was performed using R version 3.4with the pROC package and scikit-learn 0.18.1 [21, 22]. Results

Datasets

The average age of patients in the MSH cohort was 63.2 years (S.D. 16.5), compared to 49.6 years (S.D. 17years) in the IU cohort, and 46.9 years (S.D. 16.6 years) in the NIH cohort (Table 2). Positive cases ofJuly 16, 2018 7/15neumonia were remarkably more prevalent in MSH data (34.2%) than in either NIH (1.2%, χ P < < Table 2.

Baseline characteristics of datasets by hospital system.

Characteristic IU MSH NIH

Patient demographicsNo. patient radiographs 3,807 42,396 112,120No. patients 3,683 12,904 30,805Age, mean (SD), years 49.6 (17.0) 63.2 (16.5) 46.9 (16.6)No. females (%) 643 (57.1) 18,993 (44.8%) 48,780 (43.5%)Image diagnosis frequenciesPneumonia, No. (%) 39 (1.0%) 14,515 (34.2%) 1,353 (1.2%)Emphysema, No. (%) 62 (1.6%) 1,308 (3.1%) 2,516 (2.2%)Effusion, No. (%) 142 (3.7%) 19,536 (46.1%) 13,307 (11.9%)Consolidation, No. (%) 26 (0.7%) 25,318 (59.7%) 4,667 (4.2%)Nodule, No. (%) 104 (2.7%) 569 (1.3%) 6,323 (5.6%)Atelectasis, No. (%) 307 (8.1%) 16,713 (39.4%) 11,535 (10.3%)Edema, No. (%) 45 (1.2%) 7,144 (16.9%) 2,303 (2.1%)Cardiomegaly, No. (%) 328 (8.6%) 14,285 (33.7%) 2,772 (2.5%)Hernia, No. (%) 46 (1.2%) 228 (0.5%) 227 (0.2%)*Sex data available for 1,122 / 3,807 Indiana, 42,383 / 42,396 Mount Sinai; agedata available for 112,077 / 112,120 NIH

Internal and External Performance Testing

Fig 3.

Pneumonia models evaluated on internal andexternal test sets. A model trained using both MountSinai and NIH data (MSH+NIH) had higherperformance on the combined MSH+NIH test set thanon either subset individually or on fully externalIndiana (IU) data. M S H + N I H M S HN I H . . . False Alarm Rate S en s i t i v i t y Test Dataset

MSH + NIHIUNIHMSH

Validation

ExternalInternal

The internal performance of pneumonia detectionCNNs significantly exceeded external performancein 3 / 5 natural comparisons (Figure 3, Table3). CNNs trained to detect pneumonia at NIHhad internal test AUC 0.750 (95% C.I. 0.721-0.778),significantly worse external test AUC 0.695 at MSH(95% C.I. 0.683-0.706, P < < < < able 3. Internal and external pneumonia screening performance for all train - tune and test hospitalsystem combinations.

Train -TuneSite ComparisonType* Test Site(Images) AUC (95% C.I.) Acc. Sens. Spec. PPV NPV

NIH Internal NIH (N=22,062) 0.750 (0.721-0.778) 0.255 0.951 0.247 0.015 0.998External MSH (N=8,388) 0.695 (0.683-0.706) 0.476 0.950 0.212 0.401 0.884External IU (N=3,807) 0.725 (0.644-0.807) 0.190 0.974 0.182 0.012 0.999Superset MSH + NIH (N=30,450) 0.773 (0.766-0.780) 0.462 0.950 0.403 0.160 0.985Superset MSH + NIH + IU(N=34,257) 0.787 (0.780-0.793) 0.470 0.950 0.418 0.148 0.987MSH Internal MSH (N=8,388) 0.802 (0.793-0.812) 0.617 0.950 0.432 0.482 0.940External NIH (N=22,062) 0.717 (0.687-0.746) 0.184 0.951 0.175 0.014 0.997External IU (N=3,807) 0.756 (0.674-0.838) 0.099 0.974 0.090 0.011 0.997Superset MSH + NIH (N=30,450) 0.862 (0.856-0.868) 0.562 0.950 0.516 0.190 0.989Superset MSH + NIH + IU(N=34,257) 0.871 (0.865-0.877) 0.577 0.950 0.537 0.180 0.990MSH +NIH Internal MSH + NIH (N=30,450) 0.931 (0.927-0.936) 0.732 0.950 0.706 0.279 0.992Subset NIH (N=22,062) 0.733 (0.703-0.762) 0.243 0.951 0.234 0.015 0.997Subset MSH (N=8,388) 0.805 (0.796-0.814) 0.630 0.950 0.451 0.491 0.942External IU (N=3,807) 0.815 (0.745-0.885) 0.238 0.974 0.230 0.013 0.999Superset MSH + NIH + IU(N=34,257) 0.934 (0.929-0.938) 0.732 0.950 0.709 0.258 0.993*Superset= a test dataset containing data from the same distribution (hospital system) as the training data aswell as external data. Subset = a test dataset containing data from fewer distributions (hospital systems) thenthe training data.

Hospital System and Department Prediction

A CNN trained to identify hospital system accurately identified 22,050/22,062 (99.95%) of NIH, 8,386/8,388(99.98%) of MSH, and 737/771 (95.59%) of IU test radiographs. To identify radiographs originating from aspecific hospital system, such as NIH, CNNs used features from many different image regions (Figure 4a);the majority of image subregions were individually able to predict the hospital system with > = 95%certainty (35.7 / 49, 72.9%, min 21, max 49, N = 100 NIH radiographs). Laterality labels were particularlyinfluential (Figure 4b-4c).A CNN trained to identify individual departments within MSH accurately identified 5,805/5,805 (100%) ofinpatient radiographs and 449/449 (100%) of emergency department radiographs. Patients who receivedportable radiographs on an inpatient floor had a higher prevalence of pneumonia than those in the emergencydepartment (41.1% versus 32.8%, respectively, χ P < ig 4. CNN to predict hospital system detected both general and specific image features. (a) We obtainedactivation heatmaps from our trained model and averaged over a sample of images to reveal which subregionstended to contribute to a hospital system classification decision. Many different subregions stronglypredicted the correct hospital system, with especially strong contributions from image corners. (b-c) Onindividual images, which have been normalized to highlight only the most influential regions and not allthose that contributed to a positive classification, we note that the CNN has learned to detect a metal tokenthat radiology technicians place on the patient in the corner of the image field of view at the time theycapture the image. When these strong features are correlated with disease prevalence, models can leveragethem to indirectly predict disease. (a) (b) (c)

Engineered Relative Risk Experiment

Artificially increasing the difference in the prevalence of pneumonia between MSH and NIH led to CNNs thatperformed increasingly well on internal testing but not external testing (Table 4). CNNs trained onengineered prevalence cohorts of NIH and MSH data showed stronger internal AUC on a joint NIH-MSH testset when the prevalence of pneumonia was imbalanced between the two hospital systems in the trainingdataset with MSH Severe AUC 0.899 (95% C.I. 0.885-0.914, P < < < < < < ig 5. Assessing how prevalence differences in aggregated datasets encouraged confounder exploitation. (A)

Five cohorts of 20,000 patients engineered to differ only in relative pneumonia risk based on hospital system.Model performance was assessed on combined test data from the internal hospital systems (MSH+NIH) andseparately on test data from an external hospital system (IU). (B)

Although models performed better ininternal testing in the presence of extreme prevalence differences, this benefit was not seen when applied todata from new hospital systems. The natural relative risk of disease at Mount Sinai (MSH), indicated by avertical line, was quite imbalanced.

Table 4.

Internal and external pneumonia screening performance for datasets with engineered pneumoniaprevalences.

EngineeredMSH -NIH Cohorts Test Site (Images) AUC (95% C.I.) Acc. Sens. Spec. PPV NPVMSH Severe

Internal Engineered (N=3,886) 0.899 (0.885-0.914) 0.690 0.953 0.674 0.146 0.996External IU (N=3,807) 0.641 (0.552-0.730) 0.111 0.974 0.102 0.011 0.997

MSH Mild

Internal Engineered (N=3,930) 0.860 (0.839-0.882) 0.523 0.951 0.497 0.103 0.994External IU (N=3,807) 0.650 (0.548-0.752) 0.050 0.974 0.041 0.010 0.994

Balanced

Internal Engineered (N=3,838) 0.739 (0.707-0.772) 0.325 0.951 0.289 0.071 0.991External IU (N=3,807) 0.732 (0.645-0.819) 0.057 0.974 0.048 0.010 0.994

NIH Mild

Internal Engineered (N=3,960) 0.807 (0.778-0.836) 0.439 0.952 0.414 0.074 0.994External IU (N=3,807) 0.703 (0.616-0.790) 0.175 0.974 0.167 0.012 0.998

NIH Severe

Internal Engineered (N=3,928) 0.849 (0.826-0.871) 0.572 0.954 0.552 0.100 0.996External IU (N=3,807) 0.683 (0.591-0.775) 0.051 0.974 0.042 0.010 0.994

July 16, 2018 11/15 iscussion

We have demonstrated that pneumonia screening CNNs trained on data from individual or multiple hospitalsystems did not consistently generalize to external sites, nor did they make predictions exclusively based onunderlying pathology. We note that the issue of not generalizing externally is distinct from typical train/testperformance degradation, in which overfitting to training data leads to lower performance on testing data: inour experiments, all results are reported on held-out test data exclusively in both internal and externalcomparisons. Performance of the jointly trained MSH-NIH model on the joint test set (AUC 0.931) washigher than performance on either individual dataset (AUC 0.805 and 0.733, respectively), likely because themodel was able to calibrate to different prevalences across hospital systems in the joint test set by detectingspecific features in imaging. For comparison, a simple calibration-based non-CNN model that used hospitalsystem pneumonia prevalence only to make pneumonia predictions and ignored image features achieved AUC0.861 in the joint MSH-NIH test set due to the large difference in pneumonia prevalence between the MSHand NIH test sets.By engineering cohorts of varying prevalence, we demonstrated that the more pneumonia rates differedbetween hospital systems, the more they were exploited to make predictions, which led to poor generalizationon external datasets. We noted that metallic tokens indicating laterality often appeared in radiographs in asite-specific way which made hospital system identification trivial. However, CNNs did not require thisindicator: most image subregions contained features indicative of a radiograph’s origin. These results suggestthat CNNs could rely on subtle differences in acquisition protocol, image processing, or distribution pipeline(e.g., image compression) and overlook pathology. This can lead to strong internal performance that is notrealized on data from new sites. Even in the absence of recognized confounders, we would caution followingRecht et al. that “current accuracy numbers are brittle and susceptible to even minute natural variations inthe data distribution” [8].Furthermore, high-resolution radiological images are frequently aggressively downsampled (e.g., to 224 x 224pixels) to facilitate transfer learning, i.e. fine-tuning CNNs pretrained to ImageNet [15]. While practicallyconvenient, these low-resolution pretrained models are not optimal for the radiological context because theymay eliminate important details in imaging, and we believe the loss of valuable radiographic findings maylead to an increased reliance on confounding factors in making predictions. CNN architectures designedspecifically to accommodate the higher resolution of radiological imaging have demonstrated promising earlyresults [23, 24]. Given the significant interest in utilizing deep learning to analyze radiological imaging, ourfindings should give pause to considerations of rapid deployment without thorough vetting of models. Noprior studies have assessed whether radiological CNNs generalized to external datasets, which is particularlyconcerning as there are numerous, protocolized factors that can significantly skew the features in a givenradiological image.Even the development of customized deep learning models that are trained, tuned, and tested with the intentof deploying at a single site are not necessarily a solution that can control for potential confounding variables.At a finer level, we found that CNNs could separate portable radiographs from the inpatient wards andemergency department in MSH data with 100% accuracy, and that these patient groups had significantlydifferent prevalences of pneumonia. It was determined after the fact that devices from differentmanufacturers had been used in the inpatient units (Konica Minolta) and emergency department (Fujifilm),and the latter were stored in PACS in an inverted color scheme (i.e., air appears white) along with distinctivetext indicating laterality and use of a portable scanner. While these identifying features were prominent tothe model, they only became apparent to us after manual image review. If certain scanners within a hospitalare used to evaluate patients with different baseline disease prevalences (e.g., ICU versus outpatient), thesemay confound deep learning models trained on radiological data. Fully external testing – ideally on acollection of data gathered from a varied collection of hospitals – can reveal and account for such samplingbiases that may limit the generalizability a model.While we have focused our analysis on examining degradation of model performance on external test sets, wenote that it is possible for external test set performance to be either better or worse than internal. ManyJuly 16, 2018 12/15ifferent aspects of dataset construction (e.g., inclusion criteria, labeling procedure) and the underlyingclinical data (pathology prevalence and severity, confounding protocolized variables) can affect performance.For example, a model trained on noisily-labeled data that included all available imaging might reasonably beexpected to have lower internal test performance than if tested externally on a similar dataset manuallyselected and labeled by a physician as clear examples of pathological and normal cases.In addition to site-specific confounding variables that threaten generalizability, there are other factors relatedto medical management that may exist everywhere but undermine the clinical applicability of a model. Ashas been noted, chest drains that treat pneumothorax frequently appear in studies positive for pneumothoraxin NIH data; a CNN for pneumothorax may learn to detect obvious chest drains rather than a more subtlepneumothorax itself and might inaccurately negatively diagnose patients presenting with pneumothoracesbecause they lacked a chest drain [25]. Ultimately, if CNN-based systems are to be used for medicaldiagnosis, they must be tailored to carefully considered clinical questions, prospectively tested at a variety ofsites in real-world use scenarios, and carefully assessed to determine how they impact diagnostic accuracy.There are several limitations to this study. Most notably, without more granular details on the underlyingpatient populations, we are unable to fully assess what factors might be contributing to the hospitalsystem-specific biasing of the models. The extremely high incidence of pneumonia in the MSH dataset is alsoa point of concern; however, we attribute this to differences in the underlying patient populations andvariability in classification thresholds for pathology. First, a majority of MSH radiographs were portableinpatient scans, ordered for patients too unstable to travel to the radiology department for a standardradiograph. In contrast, all IU radiographs were outpatient. While the inpatient/outpatient mix from NIH isnot reported, we believe it likely contains a substantial outpatient percentage given that the incidence ofpneumonia is similar to IU. Second, our NLP approach for MSH assigned positive ground truth labels moreliberally than NIH or IU, marking a study as positive for pathology when a radiologist explicitly commentedon it as a possibility in a report, indicating that the radiographic appearance was consistent with the finding.Different radiologists may have different thresholds at which they explicitly include a possible diagnosis intheir reports. Researchers working in this area will continually have to make decisions about theirclassification threshold for labeling a study positive or negative. We believe that either of these two factorscan drive large differences in prevalences of pathology across datasets, and this variation can confounddiagnostic CNNs.An additional limitation was that radiologic diagnoses are made in the context of a patient’s history andclinical presentation, something not incorporated into our approach. Positive findings on chest radiographare necessary but not sufficient for the diagnosis of pneumonia, which is only made when the patient alsoexhibits a “constellation of suggestive clinical features” [26]. Modeling approaches that combine clinical datawith imaging findings, reflecting how radiologists practice, may be able to elucidate the contribution of eachpiece of information and offer more informative predictions. Finally, the relatively small size and low numberof pneumonia cases in Indiana data led to wide confidence intervals in IU test AUC and may have limitedour ability to detect external performance degradation in some cases. Nevertheless, many key comparisonsachieved statistical significance with even this smaller external dataset.

Conclusion

Pneumonia screening CNNs achieved better internal than external performance in 3 / 5 natural comparisons.When models were trained on pooled data from sites with different pneumonia prevalence, they performedbetter on new pooled data from these sites but not on external data. CNNs robustly identified hospitalsystem and department within a hospital which can have large differences in disease burden and mayconfound disease predictions.July 16, 2018 13/15 eferences [1] Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-rayDatabase and Benchmarks on Weakly-Supervised Classification and Localization of Common ThoraxDiseases. arXiv (cs.CV); 2017.[2] Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, et al. CheXNet: Radiologist-Level PneumoniaDetection on Chest X-Rays with Deep Learning. arXiv (cs.CV); 2017.[3] Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development andValidation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal FundusPhotographs. JAMA. 2016;316(22):2402–2410.[4] Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and Validation ofa Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images FromMultiethnic Populations With Diabetes. JAMA. 2017;318(22):2211–2223.[5] Kermany DS, Goldbaum M, Cai W, Valentim CCS, Liang H, Baxter SL, et al. Identifying MedicalDiagnoses and Treatable Diseases by Image-Based Deep Learning. Cell. 2018;172(5):1122–1131.e9.[6] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale VisualRecognition Challenge. arXiv (cs.CV); 2014.[7] LeCun Y, Bottou L, , Bengio Y, Haffner P. Gradient-based learning applied to document recognition.In: Proceedings of the IEEE; 1998.[8] Recht B, Roelofs R, Schmidt L, Shankar V. Do CIFAR-10 Classifiers Generalize to CIFAR-10? arXiv(cs.LG); 2018.[9] Rothwell PM. External validity of randomised controlled trials: “To whom do the results of this trialapply?”. Lancet. 2005;365(9453):82–93.[10] Pandis N, Chung B, Scherer RW, Elbourne D, Altman DG. CONSORT 2010 statement: extensionchecklist for reporting within person randomised trials. BMJ. 2017;357:j2835.[11] Cabitza F, Rasoini R, Gensini GF. Unintended Consequences of Machine Learning in Medicine. JAMA.2017;318(6):517–518.[12] Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, et al. Preparinga collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc.2016;23(2):304–310.[13] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444.[14] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. arXiv (cs.CV); 2015.[15] Huang G, Liu Z, Weinberger KQ, van der Maaten L. Densely Connected Convolutional Networks. arXiv(cs.CV); 2016.[16] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinkinggeneralization. arXiv (cs.LG); 2016.[17] Zech J, Pain M, Titano J, Badgeley M, Schefflein J, Su A, et al. Natural Language-based MachineLearning Models for the Annotation of Clinical Radiology Reports. Radiology. 2018;287(2).[18] Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, et al. Automatic differentiation inPyTorch; 2017.July 16, 2018 14/1519] Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for DiscriminativeLocalization. arXiv (cs.CV); 2015.[20] DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlatedreceiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845.[21] Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC; 2017.[22] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: MachineLearning in Python. J Mach Learn Res. 2011;12:2825–2830.[23] Gale W, Oakden-Rayner L, Carneiro G, Bradley AP, Palmer LJ. Detecting hip fractures withradiologist-level performance using deep neural networks. arXiv (cs.CV); 2017.[24] Geras KJ, Wolfson S, Gene Kim S, Moy L, Cho K. High-Resolution Breast Cancer Screening withMulti-View Deep Convolutional Neural Networks. arXiv (cs.CV); 2017.[25] Oakden-Rayner L. Exploring the ChestXray14 dataset: problems; 2017. https://lukeoakdenrayner.wordpress.com/2017/12/18/the-chestxray14-dataset-problems/https://lukeoakdenrayner.wordpress.com/2017/12/18/the-chestxray14-dataset-problems/