[PDF] CheXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation to Photos of Chest X-rays and External Clinical Settings

Abstract

Recent advances in training deep learning models have demonstrated the potential to provide accurate chest X-ray interpretation and increase access to radiology expertise. However, poor generalization due to data distribution shifts in clinical settings is a key barrier to implementation. In this study, we measured the diagnostic performance for 8 different chest X-ray models when applied to (1) smartphone photos of chest X-rays and (2) external datasets without any finetuning. All models were developed by different groups and submitted to the CheXpert challenge, and re-applied to test datasets without further tuning. We found that (1) on photos of chest X-rays, all 8 models experienced a statistically significant drop in task performance, but only 3 performed significantly worse than radiologists on average, and (2) on the external set, none of the models performed statistically significantly worse than radiologists, and five models performed statistically significantly better than radiologists. Our results demonstrate that some chest X-ray models, under clinically relevant distribution shifts, were comparable to radiologists while other models were not. Future work should investigate aspects of model training procedures and dataset collection that influence generalization in the presence of data distribution shifts.

Full PDF

CCheXternal: Generalization of Deep Learning Models for ChestX-ray Interpretation to Photos of Chest X-rays and ExternalClinical Settings

Pranav Rajpurkar ∗ [email protected] UniversityUSA Anirudh Joshi ∗ [email protected] UniversityUSA Anuj Pareek ∗ [email protected] UniversityUSA Andrew Y. Ng [email protected] UniversityUSA

Matthew P. Lungren [email protected] UniversityUSA

Atelectasis Inferencewithout tuning

Smartphone Photos of CXRsExternal Institution Set of CXRs8 CXR Interpretation Leaderboard Models

Atelectasis +0.06Cardiomegaly +0.03

Single Institution Set of Chest X-Rays (CXRs)

PerformanceDifference

Atelectasis +0.03Cardiomegaly +0.01Atelectasis +0.04Cardiomegaly +0.04

ConsolidationCardiomegalyEdemaPleural Effusion

Figure 1: We measured the diagnostic performance for 8 different chest X-ray models when applied to (1) smartphone photosof chest X-rays and (2) external datasets without any finetuning. All models were developed by different groups and submittedto the CheXpert challenge, and re-applied to test datasets without further tuning.

ABSTRACT

Recent advances in training deep learning models have demon-strated the potential to provide accurate chest X-ray interpretationand increase access to radiology expertise. However, poor gener-alization due to data distribution shifts in clinical settings is a keybarrier to implementation. In this study, we measured the diagnos-tic performance for 8 different chest X-ray models when appliedto (1) smartphone photos of chest X-rays and (2) external datasetswithout any finetuning. All models were developed by differentgroups and submitted to the CheXpert challenge, and re-applied to ∗ Authors contributed equally to this research.Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA © 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8359-2/21/04.https://doi.org/10.1145/3450439.3451876 test datasets without further tuning. We found that (1) on photosof chest X-rays, all 8 models experienced a statistically significantdrop in task performance, but only 3 performed significantly worsethan radiologists on average, and (2) on the external set, none of themodels performed statistically significantly worse than radiologists,and five models performed statistically significantly better thanradiologists. Our results demonstrate that some chest X-ray models,under clinically relevant distribution shifts, were comparable toradiologists while other models were not. Future work should inves-tigate aspects of model training procedures and dataset collectionthat influence generalization in the presence of data distributionshifts.

CCS CONCEPTS • Applied computing → Health informatics ; •

Computingmethodologies → Image representations . a r X i v : . [ ee ss . I V ] F e b CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren

KEYWORDS

Generalizability, Distribution Shifts, Chest X-ray Interpretation,Radiology, Clinical Deployment

ACM Reference Format:

Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Andrew Y. Ng, and MatthewP. Lungren. 2021. CheXternal: Generalization of Deep Learning Models forChest X-ray Interpretation to Photos of Chest X-rays and External ClinicalSettings. In

ACM Conference on Health, Inference, and Learning (ACM CHIL’21), April 8–10, 2021, Virtual Event, USA.

ACM, New York, NY, USA, 8 pages.https://doi.org/10.1145/3450439.3451876

Chest X-rays are the most common imaging examination in theworld, critical for diagnosis and management of many diseases.With over 2 billion chest X-rays performed globally annually, manyclinics in both developing and developed countries lack sufficienttrained radiologists to perform timely X-ray interpretation. Au-tomating cognitive tasks in medical imaging interpretation withdeep learning models could improve access, efficiency, and augmentexisting workflows [18, 20, 22, 27]. However, poor generalizationdue to data distribution shifts in clinical settings is a key barrier toimplementation.First, a major obstacle to clinical adoption of such technologies isin model deployment, an effort often frustrated by vast heterogene-ity of clinical workflows across the world [14]. Chest X-ray modelsare developed and validated using digital X-rays with many deploy-ment solutions relying on heavily integrated yet often disparateinfrastructures [1, 12, 13, 17, 21, 25, 26]. One appealing solution toscaled deployment across disparate clinical frameworks is to lever-age the ubiquity of smartphones. Interpretation of medical imag-ing via cell phone photography is an existing “store-and-forwardtelemedicine” approach in which one or more photos of medicalimaging are captured and sent as email attachments or instant mes-sages by practitioners to obtain second opinions from specialistsin routine clinical care [7, 31]. Smartphone photographs have beenshown to be of sufficient diagnostic quality to allow for medicalinterpretation, thus leveraging deep learning models in automatedinterpretation of photos of medical imaging examinations mayserve as an infrastructure agnostic approach to deployment, partic-ularly in resource limited settings. However, significant technicalbarriers exist in automated interpretation of photos of chest X-rays.Photographs of X-rays introduce visual artifacts which are notcommonly found in digital X-rays, such as altered viewing angles,variable lighting conditions, glare, moiré, rotations, translations,and blur [19]. These artifacts have been shown to reduce algorithmperformance when input images are perceived through a camera[16]. The extent to which such artifacts reduces the performanceof chest X-ray models has not been well investigated.A second major obstacle to clinical adoption of chest X-ray mod-els is that clinical deployment requires models trained on data fromone institution to generalize to data from another institution [2, 14].Early work has shown that chest X-ray models may not generalizewell when externally validated on data from a different institutionand are possibly vulnerable to distribution shift stemming fromchange in patient population or rely on non-medically relevant cuesbetween institutions [33]. However, the difference in diagnostic performance of more recent chest X-ray models to external datasetshas not been investigated.We measured the diagnostic performance for 8 different chestX-ray models when applied to (1) photos of chest X-rays, and (2)chest X-rays obtained at a different institution. Specifically, weapplied these models to a dataset of smartphone photos of 668 X-rays from 500 patients, and a set of 420 frontal chest X-rays from theChestXray-14 dataset collected at the National Institutes of HealthClinical Center [32]. All models were developed by different groupsand submitted to the CheXpert challenge, a large public competitionfor digital chest X-ray analysis [10]. Models were evaluated on theirdiagnostic performance in binary classification, as measured byMatthew’s Correlation Coefficient (MCC) [3], on the followingpathologies selected in Irvin et al. [10]: atelectasis, cardiomegaly,consolidation, edema, and pleural effusion [10].We found that:(1) In comparison of model performance on digital chest X-raysto photos, all 8 models experienced a statistically significantdrop in task performance on photos with an average dropof 0.036 MCC. In comparison of performance of models onphotos compared to radiologist performance, three out ofeight models performed significantly worse than radiologistson average, and the other five had no significant difference.(2) On the external set (NIH), none of the models performed sta-tistically significantly worse than radiologists. On averageover the pathologies, five models performed significantlybetter than radiologists. On specific pathologies (consolida-tion, cardiomegaly, edema, and atelectasis), there were somemodels that achieved significantly better performance thanradiologists.Our systematic examination of the generalization capabilitiesof existing models can be extended to other tasks in medical AI,and provide a framework for tracking technical readiness towardsclinical translation.

We collected a test set of photos of chest x-rays, described in Phillipset al. [19]. In this set, chest X-rays from each CheXpert test studywere displayed on a non-diagnostic computer monitor. Chest X-rayswere displayed in full screen on a computer monitor with 1920 × heXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA Metric Comparison Average Pleural Effusion Edema Atelectasis Consolidation CardiomegalyAUC Photos 0.856 (0.840,0.869) 0.950 (0.932,0.968) 0.917 (0.884,0.943) 0.882 (0.856,0.912) 0.914 (0.865,0.946) 0.921 (0.900,0.940)Standard 0.871 (0.855,0.883) 0.960 (0.944,0.975) 0.926 (0.892,0.950) 0.885 (0.858,0.910) 0.918 (0.879,0.948) 0.934 (0.914,0.951)Standard - Photos 0.016 (0.012,0.019) 0.011 (0.004,0.019) 0.009 (0.001,0.018) 0.003 (-0.006,0.013) 0.005 (-0.009,0.016) 0.013 (0.006,0.023)MCC Photos 0.534 (0.507,0.559) 0.571 (0.526,0.631) 0.556 (0.481,0.639) 0.574 (0.505,0.634) 0.316 (0.246,0.386) 0.580 (0.522,0.630)Standard 0.570 (0.543,0.599) 0.621 (0.575,0.670) 0.550 (0.474,0.637) 0.587 (0.529,0.640) 0.336 (0.264,0.418) 0.643 (0.584,0.695)Standard - Photos 0.036 (0.024,0.048) 0.049 (0.020,0.070) -0.006 (-0.039,0.033) 0.012 (-0.016,0.041) 0.020 (-0.011,0.047) 0.063 (0.036,0.084)

Table 1: AUC and MCC performance of models and radiologists on the standard X-rays and the photos of chest X-rays, with95% confidence intervals.

Comparison Average Pleural Effusion Edema Atelectasis Consolidation CardiomegalyPhotos 0.534 (0.507,0.559) 0.571 (0.526,0.631) 0.556 (0.481,0.639) 0.574 (0.505,0.634) 0.316 (0.246,0.386) 0.580 (0.522,0.630)Radiologists 0.568 (0.542,0.597) 0.671 (0.618,0.727) 0.507 (0.431,0.570) 0.548 (0.496,0.606) 0.359 (0.262,0.444) 0.566 (0.511,0.620)Radiologists - Photos 0.035 (0.009,0.065) 0.099 (0.056,0.145) -0.049 (-0.136,0.029) -0.027 (-0.086,0.050) 0.042 (-0.056,0.124) -0.014 (-0.069,0.029)

Table 2: MCC performance of models on the photos of chest X-rays, radiologist performance, and their difference, with 95%confidence intervals.Figure 2: MCC differences of 8 chest X-ray models on differ-ent pathologies between photos of the X-rays and the origi-nal X-rays with 95% confidence intervals.

CheXpert used a hidden test set for official evaluation of models.Teams submitted their executable code, which was then run on atest set that was not publicly readable to preserve the integrity of

Figure 3: MCC differences of the same models on photos ofchest X-rays compared to radiologist performance with 95%confidence intervals. the test results. We made use of the CodaLab platform to re-runthese chest X-ray models by substituting the hidden CheXpert testset with the datasets used in this study.

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren

Figure 4: Comparison of the average AUC of 8 individualmodels on photos of chest X-rays compared to on standardimages

Our primary evaluation metric was Matthew’s Correlation Coeffi-cient (MCC), a statistical rate which produces a high score only ifthe prediction obtained good results in all of the four confusion ma-trix categories (true positives, false negatives, true negatives, andfalse positives); MCC is proportionally both to the size of positiveelements and the size of negative elements in the dataset [3].We reported the average MCC of 8 models for five pathologies,namely atelectasis, cardiomegaly, consolidation, edema, and pleuraleffusion. Additionally, in experiments comparing the models onstandard chest X-rays to photos of chest X-rays, we reported theAUC and MCC of the models. In experiments comparing models toboard-certified radiologists, we reported the difference in MCC foreach of the five pathologies.

In comparisonof model performance on digital chest X-rays to photos, all eightmodels experienced a statistically significant drop in task perfor-mance on photos with an average drop of 0.036 MCC (95% CI 0.024,0.048) (See Figure 2, Table 1). All models had a statistically signifi-cant drop on at least one of the pathologies between native digitalimage to photos. One model had a statistically significant drop inperformance on three pathologies: pleural effusion, edema, and con-solidation. Two models had a significant drop on two pathologies:one on pleural effusion and edema, and the other on pleural effusion and cardiomegaly. The cardiomegaly and pleural effusion tasks ledto decreased performance in five and four models respectively.

In comparison of performance ofmodels on photos compared to radiologist performance, three outof eight models performed significantly worse than radiologistson average, and the other five had no significant difference (seeFigure 3). On specific pathologies, there were some models thathad a significantly higher performance than radiologists: two mod-els on cardiomegaly, and one model on edema. Conversely, therewere some models that had a significantly lower performance thanradiologists: two models on cardiomegaly, and one model on con-solidation. The pathology with the greatest number of models thathad a significantly lower performance than radiologists was pleuraleffusion (seven models).

Ourresults demonstrated that while most models experienced a sig-nificant drop in performance when applied to photos of chestX-rays compared to the native digital image, their performancewas nonetheless largely equivalent to radiologist performance. Wefound that although there were thirteen times that models had a sta-tistically significant drop in performance on photos on the differentpathologies, the models had significantly lower performance thanradiologists only 6 of those 13 times. Comparison to radiologist per-formance provides context in regard to clinical applicability: severalmodels remained comparable to radiologist performance standarddespite decreased performance on photos. Further investigationcould be directed towards understanding how different model train-ing procedures may affect model generalization to photos of chestX-rays, and understanding etiologies behind trends for changes inperformance for specific pathologies or specific artifacts.

While using photos of chest X-rays to input intochest X-ray algorithms could enable any physician with a smart-phone to get instant AI algorithm assistance, the performance ofchest X-ray algorithms on photos of chest X-rays has not beenthoroughly investigated. Several studies have highlighted the im-portance of generalizability of computer vision models with noisein [8]. Dodge and Karam [4] demonstrated that deep neural net-works perform poorly compared to humans on image classificationon distorted images. Geirhos et al. [6], Schmidt et al. [24] havefound that convolutional neural networks trained on specific imagecorruptions did not generalize, and the error patterns of networkand human predictions were not similar on noisy and elasticallydeformed images.

We measured the change in diagnostic performance of the sameeight chest X-ray models on chest X-rays obtained at a different in-stitution. We applied these models, trained on the CheXpert datasetfrom the Stanford Hospital, to a set of 420 frontal chest X-rayslabeled as part of Rajpurkar et al. [22]. These X-rays are sourcedfrom the ChestXray-14 dataset collected at the National Institutes ofHealth Clinical Center [32], and sampled to contain at least 50 casesof each pathology according to the original labels provided in the heXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA

Institution Comparison Average Pleural Effusion Edema Atelectasis Consolidation CardiomegalyCheXpert Radiologists 0.568 (0.542,0.597) 0.671 (0.618,0.727) 0.507 (0.431,0.570) 0.548 (0.496,0.606) 0.359 (0.262,0.444) 0.566 (0.511,0.620)Models 0.570 (0.543,0.599) 0.621 (0.575,0.670) 0.550 (0.474,0.637) 0.587 (0.529,0.640) 0.336 (0.264,0.418) 0.643 (0.584,0.695)Models - Radiologists 0.002 (-0.028,0.030) -0.05 (-0.092,-0.007) 0.043 (-0.033,0.114) 0.039 (-0.029,0.106) -0.022 (-0.104,0.076) 0.077 (0.040,0.135)NIH Radiologists 0.537 (0.515,0.555) 0.642 (0.590,0.690) 0.618 (0.549,0.669) 0.469 (0.423,0.515) 0.455 (0.385,0.509) 0.492 (0.443,0.530)Models 0.578 (0.551,0.601) 0.673 (0.605,0.734) 0.662 (0.582,0.742) 0.529 (0.454,0.595) 0.551 (0.499,0.623) 0.517 (0.466,0.567)Models - Radiologists 0.041 (0.010,0.072) 0.032 (-0.019,0.078) 0.044 (-0.028,0.124) 0.060 (-0.003,0.126) 0.096 (0.027,0.155) 0.025 (-0.028,0.078)

Table 3: MCC performance of models and radiologists on the CheXpert and NIH sets of chest X-rays, and their difference, with95% confidence intervals.Figure 5: MCC differences in performance of models onthe CheXpert test set, with 95% confidence intervals (higherthan 0 is in favor of the models being better). dataset. The reference standard on this set (NIH) was determinedusing a majority vote of three cardiothoracic subspecialty radiol-ogists; six board-certified radiologists were used for comparisonagainst the models.

On the external set (NIH), none of the models performedstatistically significantly worse than radiologists (see Figure 6). Onaverage over the pathologies, five models performed significantlybetter than radiologists. On specific pathologies, there were somemodels that achieved significantly better performance than radiol-ogists: six models on consolidation, three models on cardiomegaly,four on edema, and two on atelectasis, one on pleural effusion.

Figure 6: MCC differences in performance of the same mod-els compared to another set of radiologists across the samepathologies on an external institution’s (NIH) data.

Our finding that these models perform compa-rably to or at a level exceeding radiologists differs from a previousstudy which reported that a chest X-ray model failed to general-ize to new populations or institutions separate from the trainingdata, relying on institution specific and/or confounding cues toinfer the label of interest [33]. Our findings may be attributed tothe improvement in the generalizability of chest X-ray models ow-ing to larger and higher-quality datasets that have been publiclyreleased [10, 11] Future work should investigate specific aspectsof model training and dataset quality and size that lend to thesedifferences, and whether self-supervised training procedures [28]increase generalizability across institutions.

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren

Figure 7: Overall change in performance of models (blue) and radiologists (orange) across CheXpert and the external institu-tion dataset (NIH).

Comparing performances on the CheXpert and NIH test sets, wefound that on the NIH data set, in 16 instances models had a sig-nificantly better performance than radiologists; on the internalCheXpert test set, we observed that in 6 instances, models had asignificantly higher performance than radiologists (see Figure 5).This difference may be attributed to a variety of factors includingthe difference in prevalence of pathologies or the difficulty in iden-tifying them in the external test set compared to the internal set.We are able to contextualize the generalization ability of models toexternal institutions by comparing their differences to a radiologistperformance benchmark, rather than provide a comparison of theirabsolute performances, which would not control for these possibledifferences. For instance, when considering cardiomegaly (see Fig-ure 7), we observe a drop in model performance, which in isolationwould indicate poor generalizability. However, in light of a similardrop in radiologist performance, we may be able to attribute thedifference to differences in difficulties between the two datasets.

The purpose of this work was to systematically address the keytranslation challenges for chest X-ray models in clinical applicationto common real-world scenarios. We found that several chest X-raymodels had a drop in performance when applied to smartphonephotos of chest X-rays, but even with this drop, some models stillperformed comparably to radiologists. We also found that whenmodels were tested on an external institution’s data, they performedcomparably to radiologists. In both forms of clinical distributionshifts we found that high-performance chest X-ray interpretationmodels trained on CheXpert produced clinically useful diagnosticperformance.Our work makes significant contributions over another inves-tigation of chest X-ray models [23]. While their study consideredthe differences in AUC of models when applied to photos of X-rays,they did not (1) compare the resulting performances against radiol-ogists, (2) investigate the drop in performances on specific tasks, heXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA or (3) analyze drops in performances of individual models acrosstasks. Finally, while they compared the performance of models toradiologists on an external dataset, they did not investigate thechange in performance of models between the internal dataset andthe external dataset.Strengths of our study include our systematic investigation ofgeneralization performance of several chest X-ray models devel-oped by different teams. Limitations of our work include that ourstudy is still retrospective in nature, and prospective studies wouldfurther advance understanding of generalization under distributionshifts. Our systematic examination of the generalization capabilitiesof existing models can be extended to other tasks in medical AI[5, 9, 15, 29, 30], and provide a framework for tracking technicalreadiness towards clinical translation.

REFERENCES [1] Savvas Andronikou, Kieran McHugh, Nuraan Abdurahman, Bryan Khoury, VictorMngomezulu, William E Brant, Ian Cowan, Mignon McCulloch, and Nathan Ford.2011. Paediatric radiology seen from Africa. Part I: providing diagnostic imagingto a young population.

Pediatric radiology

41, 7 (2011), 811–825.[2] David Chen, Sijia Liu, Paul Kingsbury, Sunghwan Sohn, Curtis B. Storlie, Eliza-beth B. Habermann, James M. Naessens, David W. Larson, and Hongfang Liu. 2019.Deep learning and alternative learning strategies for retrospective real-world clin-ical data. npj Digital Medicine

2, 1 (Dec. 2019), 43. https://doi.org/10.1038/s41746-019-0122-0[3] Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthewscorrelation coefficient (MCC) over F1 score and accuracy in binary classificationevaluation.

BMC genomics

21, 1 (2020), 6.[4] Samuel Dodge and Lina Karam. 2017. A Study and Comparison of Human andDeep Learning Recognition Performance under Visual Distortions. In . 1–7.https://doi.org/10.1109/ICCCN.2017.8038465[5] Tony Duan, Pranav Rajpurkar, Dillon Laird, Andrew Y. Ng, and Sanjay Basu. 2019.Clinical Value of Predicting Individual Treatment Effects for Intensive BloodPressure Therapy: A Machine Learning Experiment to Estimate Treatment Effectsfrom Randomized Trial Data.

Circulation: Cardiovascular Quality and Outcomes

12, 3 (March 2019). https://doi.org/10.1161/CIRCOUTCOMES.118.005010[6] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A.Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biasedtowards texture; increasing shape bias improves accuracy and robustness. arXiv:1811.12231 [cs, q-bio, stat] (Jan. 2019).[7] Hans Goost, Johannes Witten, Andreas Heck, Dariusch R Hadizadeh, OliverWeber, Ingo Gräff, Christof Burger, Mareen Montag, Felix Koerfer, and KoroushKabir. 2012. Image and diagnosis quality of X-ray image transmission via cellphone camera: a project study evaluating quality and reliability.

PLoS One

7, 10(2012), e43402.[8] Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural NetworkRobustness to Common Corruptions and Perturbations. arXiv:1903.12261 [cs,stat] (March 2019).[9] Shih-Cheng Huang, Tanay Kothari, Imon Banerjee, Chris Chute, Robyn L Ball,Norah Borus, Andrew Huang, Bhavik N Patel, Pranav Rajpurkar, Jeremy Irvin,et al. 2020. PENet—A scalable deep-learning model for automated diagnosis ofpulmonary embolism using volumetric CT imaging.

NPJ digital medicine

3, 1(2020), 1–9.[10] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus,Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya,Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones,David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and An-drew Y. Ng. 2019. CheXpert: A Large Chest Radiograph Dataset with UncertaintyLabels and Expert Comparison.

Proceedings of the AAAI Conference on ArtificialIntelligence

33 (July 2019), 590–597. https://doi.org/10.1609/aaai.v33i01.3301590[11] Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum,Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, and Steven Horng. 2019.MIMIC-CXR, a de-identified publicly available database of chest radiographswith free-text reports.

Scientific Data

6, 1 (Dec. 2019), 317. https://doi.org/10.1038/s41597-019-0322-0[12] K. Kallianos, J. Mongan, S. Antani, T. Henry, A. Taylor, J. Abuya, and M.Kohli. 2019. How far have we come? Artificial intelligence for chest radio-graph interpretation.

Clinical Radiology

74, 5 (May 2019), 338–345. https://doi.org/10.1016/j.crad.2018.12.015[13] Satyananda Kashyap, Mehdi Moradi, Alexandros Karargyris, Joy T. Wu, MichaelMorris, Babak Saboury, Eliot Siegel, and Tanveer Syeda-Mahmood. 2019. Artificial intelligence for point of care radiograph quality assessment. In

Medical Imaging2019: Computer-Aided Diagnosis , Vol. 10950. International Society for Optics andPhotonics, 109503K. https://doi.org/10.1117/12.2513092[14] Christopher J. Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado,and Dominic King. 2019. Key challenges for delivering clinical impact withartificial intelligence.

BMC Medicine

17, 1 (Dec. 2019), 195. https://doi.org/10.1186/s12916-019-1426-2[15] Amirhossein Kiani, Bora Uyumazturk, Pranav Rajpurkar, Alex Wang, RebeccaGao, Erik Jones, Yifan Yu, Curtis P. Langlotz, Robyn L. Ball, Thomas J. Mon-tine, Brock A. Martin, Gerald J. Berry, Michael G. Ozawa, Florette K. Hazard,Ryanne A. Brown, Simon B. Chen, Mona Wood, Libby S. Allard, Lourdes Ylagan,Andrew Y. Ng, and Jeanne Shen. 2020. Impact of a deep learning assistant on thehistopathologic classification of liver cancer. npj Digital Medicine

3, 1 (Dec. 2020),23. https://doi.org/10.1038/s41746-020-0232-8[16] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. 2016. Adversarial examplesin the physical world.

CoRR abs/1607.02533 (2016). _eprint: 1607.02533.[17] Paras Lakhani and Baskaran Sundaram. 2017. Deep Learning at Chest Radio-graphy: Automated Classification of Pulmonary Tuberculosis by Using Con-volutional Neural Networks.

Radiology

BioMedical Engineering OnLine

17, 1 (Aug. 2018), 113. https://doi.org/10.1186/s12938-018-0544-y[21] Zhi Zhen Qin, Melissa S. Sander, Bishwa Rai, Collins N. Titahong, Santat Sudrun-grot, Sylvain N. Laah, Lal Mani Adhikari, E. Jane Carter, Lekha Puri, Andrew J.Codlin, and Jacob Creswell. 2019. Using artificial intelligence to read chest ra-diographs for tuberculosis detection: A multi-site evaluation of the diagnosticaccuracy of three deep learning systems.

Scientific Reports

9, 1 (Oct. 2019), 1–10.https://doi.org/10.1038/s41598-019-51503-3[22] Pranav Rajpurkar, Jeremy Irvin, Robyn L. Ball, Kaylie Zhu, Brandon Yang, HershelMehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis P. Langlotz, Bhavik N. Patel,Kristen W. Yeom, Katie Shpanskaya, Francis G. Blankenberg, Jayne Seekins,Timothy J. Amrhein, David A. Mong, Safwan S. Halabi, Evan J. Zucker, Andrew Y.Ng, and Matthew P. Lungren. 2018. Deep learning for chest radiograph diagnosis:A retrospective comparison of the CheXNeXt algorithm to practicing radiologists.

PLOS Medicine

15, 11 (Nov. 2018), e1002686. https://doi.org/10.1371/journal.pmed.1002686[23] Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Phil Chen, Amirhossein Kiani,Jeremy Irvin, Andrew Y. Ng, and Matthew P. Lungren. 2020. CheXpedition: Inves-tigating Generalization Challenges for Translation of Chest X-Ray Algorithms tothe Clinical Setting. arXiv:2002.11379 [eess.IV][24] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Alek-sander Madry. 2018. Adversarially Robust Generalization Requires More Data.In

Advances in Neural Information Processing Systems 31 , S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Asso-ciates, Inc., 5014–5026.[25] Adam B Schwartz, Gina S Siddiqui, John L Barbieri, Amana F Akhtar, Woojin KKim, Ryan A Littman-Quinn, Emily S Conant, Narainder D Gupta, Bryan APukenas, Parvati H Ramchandani, and et al. 2014. The accuracy of mobileteleradiology in the evaluation of chest X-rays.

Journal of Telemedicine andTelecare (Oct. 2014). Publisher: Journal of Telemedicine and Telecare.[26] George Shih, Carol C. Wu, Safwan S. Halabi, Marc D. Kohli, Luciano M. Prevedello,Tessa S. Cook, Arjun Sharma, Judith K. Amorosa, Veronica Arteaga, MayaGalperin-Aizenberg, Ritu R. Gill, Myrna C.B. Godoy, Stephen Hobbs, Jean Jeudy,Archana Laroia, Palmi N. Shah, Dharshan Vummidi, Kavitha Yaddanapudi, andAnouk Stein. 2019. Augmenting the National Institutes of Health Chest Radio-graph Dataset with Expert Annotations of Possible Pneumonia.

Radiology: Artifi-cial Intelligence

1, 1 (Jan. 2019), e180041. https://doi.org/10.1148/ryai.2019180041[27] Ramandeep Singh, Mannudeep K. Kalra, Chayanin Nitiwarangkul, John A. Patti,Fatemeh Homayounieh, Atul Padole, Pooja Rao, Preetham Putha, Victorine V.Muse, Amita Sharma, and Subba R. Digumarthy. 2018. Deep learning in chestradiography: Detection of findings and presence of change.

PLoS ONE

13, 10 (Oct.2018). https://doi.org/10.1371/journal.pone.0204155[28] Hari Sowrirajan, Jingbo Yang, Andrew Y. Ng, and Pranav Rajpurkar. 2020. MoCoPretraining Improves Representation and Transferability of Chest X-ray Models.arXiv:2010.05352 [cs.CV]

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Pranav Rajpurkar, Anirudh Joshi, Anuj Pareek, Andrew Y. Ng, and Matthew P. Lungren [29] Eric J. Topol. 2019. High-performance medicine: the convergence of humanand artificial intelligence.

Nature Medicine

25, 1 (Jan. 2019), 44–56. https://doi.org/10.1038/s41591-018-0300-7[30] Maya Varma, Mandy Lu, Rachel Gardner, Jared Dunnmon, Nishith Khandwala,Pranav Rajpurkar, Jin Long, Christopher Beaulieu, Katie Shpanskaya, Li Fei-Fei,Matthew P. Lungren, and Bhavik N. Patel. 2019. Automated abnormality detectionin lower extremity radiographs using deep learning.

Nature Machine Intelligence

1, 12 (Dec. 2019), 578–583. https://doi.org/10.1038/s42256-019-0126-0[31] DJ Vassallo, PJ Buxton, JH Kilbey, and M Trasler. 1998. The first telemedicinelink for the British Forces.

Journal of the Royal Army Medical Corps

Proceedings of the IEEE conference on computer vision andpattern recognition . 2097–2106.[33] John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J.Titano, and Eric Karl Oermann. 2018. Variable generalization performance of adeep learning model to detect pneumonia in chest radiographs: A cross-sectionalstudy.