[PDF] VisualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels

Abstract

Automatic extraction of medical conditions from free-text radiology reports is critical for supervising computer vision models to interpret medical images. In this work, we show that radiologists labeling reports significantly disagree with radiologists labeling corresponding chest X-ray images, which reduces the quality of report labels as proxies for image labels. We develop and evaluate methods to produce labels from radiology reports that have better agreement with radiologists labeling images. Our best performing method, called VisualCheXbert, uses a biomedically-pretrained BERT model to directly map from a radiology report to the image labels, with a supervisory signal determined by a computer vision model trained to detect medical conditions from chest X-ray images. We find that VisualCheXbert outperforms an approach using an existing radiology report labeler by an average F1 score of 0.14 (95% CI 0.12, 0.17). We also find that VisualCheXbert better agrees with radiologists labeling chest X-ray images than do radiologists labeling the corresponding radiology reports by an average F1 score across several medical conditions of between 0.12 (95% CI 0.09, 0.15) and 0.21 (95% CI 0.18, 0.24).

Full PDF

VVisualCheXbert: Addressing the Discrepancy BetweenRadiology Report Labels and Image Labels

Saahil Jain ∗ [email protected] UniversityUSA Akshay Smit ∗ [email protected] UniversityUSA Steven QH Truong

VinBrainVietnam

Chanh DT NguyenMinh-Thanh Huynh

VinBrainVietnam

Mudit Jain

USA

Victoria A. Young

Stanford UniversityUSA

Andrew Y. Ng [email protected] UniversityUSA

Matthew P. Lungren † [email protected] UniversityUSA Pranav Rajpurkar † [email protected] UniversityUSA Aside from linear scar, atelectasis in the left midlung, lungs are clear. Cardiomediastinal and hilar silhouettes and pleural surfaces are normal.No pneumonia or other evidence of intrathoracic infection.

Biomedicallypre-trainedBERT

AtelectasisEdemaNo Finding

X-ray imaging CNN [CLS]TOK 1TOK 2TOK N C Atelectasis: 1Edema: 0

Predictedimage labels

No Finding: 0

Supervision signal ...... ...

Figure 1: The VisualCheXbert training procedure. VisualCheXbert uses a biomedically-pretrained BERT model to directly mapfrom a radiology report to the labels obtained by a radiologist interpreting the associated X-ray image. The training procedurefor VisualCheXbert is supervised by a computer vision model trained to detect medical conditions from chest X-ray images.

ABSTRACT

Automatic extraction of medical conditions from free-text radiol-ogy reports is critical for supervising computer vision models to ∗ Equal Contribution † Equal ContributionPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA © 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8359-2/21/04.https://doi.org/10.1145/3450439.3451862 interpret medical images. In this work, we show that radiologistslabeling reports significantly disagree with radiologists labelingcorresponding chest X-ray images, which reduces the quality ofreport labels as proxies for image labels. We develop and evaluatemethods to produce labels from radiology reports that have betteragreement with radiologists labeling images. Our best perform-ing method, called VisualCheXbert, uses a biomedically-pretrainedBERT model to directly map from a radiology report to the imagelabels, with a supervisory signal determined by a computer visionmodel trained to detect medical conditions from chest X-ray images.We find that VisualCheXbert outperforms an approach using anexisting radiology report labeler by an average F1 score of 0.14(95% CI 0.12, 0.17). We also find that VisualCheXbert better agrees a r X i v : . [ ee ss . I V ] F e b CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Jain & Smit et al. with radiologists labeling chest X-ray images than do radiologistslabeling the corresponding radiology reports by an average F1 scoreacross several medical conditions of between 0.12 (95% CI 0.09, 0.15)and 0.21 (95% CI 0.18, 0.24).

CCS CONCEPTS • Computing methodologies → Natural language processing ; Information extraction . KEYWORDS natural language processing, BERT, medical report labeling, chestX-ray diagnosis

ACM Reference Format:

Saahil Jain, Akshay Smit, Steven QH Truong, Chanh DT Nguyen, Minh-Thanh Huynh, Mudit Jain, Victoria A. Young, Andrew Y. Ng, MatthewP. Lungren, and Pranav Rajpurkar. 2021. VisualCheXbert: Addressing theDiscrepancy Between Radiology Report Labels and Image Labels. In

ACMConference on Health, Inference, and Learning (ACM CHIL ’21), April 8–10, 2021, Virtual Event, USA.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3450439.3451862

Because manually annotating a large number of medical imagesis costly [1, 4, 6, 11, 17, 27, 31], an appealing solution is the use ofautomatic labelers to extract labels from medical text reports thataccompany the images. On the task of chest X-ray interpretation,high-performing vision models have been successfully trained [22,24–26, 30, 33] on large, publicly available chest X-ray datasets [13,14, 23, 32] labeled by automated radiology report labelers [13, 16, 20,28]. However, training these vision models on labels obtained fromreports assumes that the report labels are good proxies for imagelabels. Prior work has found that report labels may not accuratelyreflect the visual content of medical images [18, 19, 29].We investigate this assumption in the setting of automated chestX-ray labeling and develop methods to produce labels from ra-diology reports that better agree with radiologists labeling thecorresponding X-ray images. Our primary contributions are:(1) We quantify the agreement between radiologists labelingreports and radiologists labeling images across several medi-cal conditions. We find that there is significant disagreementbetween board-certified radiologists when labeling a chestX-ray image and when labeling the corresponding radiologyreport.(2) Upon board-certified radiologist review of examples of dis-agreements between radiologists labeling reports and radiol-ogists labeling images, we find various reasons for disagree-ment related to (a) label hierarchy relationships, (b) accessto clinical history, (c) the use of the

Impression and

Findings section of radiology reports, and (d) the inherent noise ofthe labeling task.(3) We find many significant relationships between presenceof conditions labeled using reports and presence of condi-tions labeled using images. We report and clinically interpretvarious radiology report labels that increase (or decrease)the odds of particular conditions in an image with statisticalsignificance. (4) We learn to map textual radiology reports directly to theX-ray image labels. Our best performing method, called

Vi-sualCheXbert , uses a biomedically-pretrained BERT modelto directly map from a radiology report to the image labels.We find that VisualCheXbert better agrees with radiologistslabeling chest X-ray images than do radiologists labelingthe corresponding radiology reports by an average F1 scoreacross several medical conditions of between 0.12 (95% CI0.09, 0.15) and 0.21 (95% CI 0.18, 0.24). We also find that Vi-sualCheXbert outperforms an approach using the CheXpertradiology report labeler [13] by an average F1 score of 0.14(95% CI 0.12, 0.17).We expect that our methods of addressing the discrepancy be-tween medical report labels and image labels are broadly usefulacross the medical domain and may facilitate the development ofimproved medical imaging models.

We made use of two large publicly available datasets of chest X-rays:CheXpert [13] and MIMIC-CXR [14]. For both datasets, we use the

Impression section of the radiology reports, which summarizes thekey findings in the radiographic study. Each of the X-rays in thesedatasets was labeled for 14 commonly occurring medical conditions.CheXpert consists of 224,316 chest radiographs, with labels gener-ated from the corresponding radiology report impression by theautomatic, rules-based CheXpert labeler. Given a radiology reportimpression as input, the CheXpert labeler labels each medical con-dition (except “No Finding”) as “positive”, “negative”, “uncertain”or “blank”. A “blank” label is produced by the CheXpert labeler ifthe condition was not mentioned at all in the report impression.If the condition was mentioned but its presence was negated, a“negative” label is produced. If the condition was mentioned but itspresence was uncertain, an “uncertain” label is produced. For “NoFinding”, the CheXpert labeler only produces “positive” or “blank”labels. “No Finding” is only labeled as “positive” if no medical ab-normality whatsoever was mentioned in the report impression.The MIMIC-CXR dataset consists of 377,110 chest X-rays and theircorresponding radiology reports, and it has also been labeled bythe CheXpert labeler.The CheXpert dataset contains a separate set of 200 chest X-raystudies called the “CheXpert validation set” and another set of 500chest X-ray studies called the “CheXpert test set”. The CheXpertvalidation set is labeled by the majority vote of 3 board-certifiedradiologists examining the X-ray images and labeling each of the 14conditions as “positive” or “negative”, similar to the image groundtruth on the CheXpert test set, which is described below. No radiol-ogist report labels are obtained for the validation set.The CheXpert test set, which was collected by Irvin et al. [13],is labeled by radiologists in two distinct ways.

Image ground truth. isualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA

Table 1: Agreement between radiologists looking at reportsand radiologists looking at the corresponding X-ray images.The high and low scores are obtained by mapping uncertainlabels in the radiologist report labels to the image groundtruth labels and the opposite of the image ground truth la-bels respectively.

Condition(n =

Atelectasis (n=153) 0.230 0.595 -0.014 0.457Cardiomegaly (n=151) 0.422 0.463 0.290 0.344Edema (n=78) 0.453 0.581 0.335 0.492Pleural Effusion (n=104) 0.638 0.710 0.511 0.613Enlarged Cardiom. (n=253) 0.089 0.208 -0.053 0.097Lung Opacity (n=264) 0.683 0.686 0.401 0.405Support Devices (n=261) 0.863 0.863 0.737 0.737No Finding (n=62) 0.381 0.381 0.292 0.292Average 0.470 0.561 0.312 0.430Weighted Average 0.492 0.575 0.320 0.427

Radiologist report labels.

A board-certified radiologist looked ateach radiology report impression corresponding to the X-rays andlabeled each of the 14 conditions as being “positive”, “negative”,“uncertain”, or “blank”. This radiologist did not observe any X-rayimages. A condition was labeled as “blank” if it was not at allmentioned in the report impression. If the condition was mentionedbut its presence in the chest X-ray was negated, then the conditionwas labeled as “negative”. If the condition was mentioned but itspresence was uncertain, it was labeled as “uncertain”.

We only evaluate our models on medical conditions for which atleast 50 out of the 500 chest X-ray studies in the CheXpert test setwere marked positive by the radiologists labeling the X-ray images(image ground truth). These conditions, which we refer to as the evaluation conditions , are: Atelectasis, Cardiomegaly, Edema, Pleu-ral Effusion, Enlarged Cardiomediastinum, Lung Opacity, SupportDevices, and No Finding. We evaluate models using the average andweighted average of the F1 score across conditions on the CheXperttest set with the image ground truth. To compute the weighted av-erage, each condition is weighted by the portion of positive labelsfor that condition in the CheXpert test set.

We first investigate the extent of the disagreement between board-certified radiologists when labeling a chest X-ray image and whenlabeling the corresponding radiology report.

We compute the level of agreement between radi-ologists labeling X-ray images and radiologists labeling the corre-sponding radiology reports on the CheXpert test set. The CheXperttest set contains a set of labels from radiologists labeling X-rayimages as well as another set of labels from radiologists labelingthe corresponding radiology reports. Using the labels from X-rayimages as the ground truth, we compute Cohen’s Kappa [5] as wellas the F1 score to measure the agreement between these two sets of labels. To compare the radiologist report labels to the image groundtruth labels, we convert the radiologist report labels to binary la-bels as follows. We map the blank labels produced for the radiologyreport to negative labels. We map uncertain labels to either theimage ground truth label or the opposite of the image ground truthlabel, and we record the results for both these strategies to obtain“Low F1”, “High F1”, “Low Kappa”, and “High Kappa” scores. Thelow and high scores represent the most pessimistic and optimisticmapping of the uncertainty labels.

We find that there is significant disagreement, whichis indicated by low Kappa and F1 scores for almost all conditionsevaluated. For example, Enlarged Cardiomediastinum and No Find-ing have a relatively small “High Kappa” score of 0.097 and 0.292and a “High F1” score of 0.208 and 0.381, indicating high levels ofdisagreement even when assuming the most optimistic mapping ofthe uncertainty labels. Atelectasis, Cardiomegaly, Edema, PleuralEffusion, and Lung Opacity also have a low “High Kappa” scoreof 0.457, 0.344, 0.492, 0.613, and 0.405 respectively and a “High F1”score of 0.595, 0.463, 0.581, 0.710, and 0.686 respectively. SupportDevices has the highest Kappa score, with a “High Kappa” of 0.737,and the highest F1 score, with a “High F1” of 0.863. The averageKappa score is between 0.312 and 0.430, and the average F1 score isbetween 0.470 and 0.561. The low and high F1 / Kappa scores forthe evaluation conditions are shown in Table 1.

We investigate why there is disagreement between board-certifiedradiologists when labeling a chest X-ray image and when labelingthe corresponding radiology report.

A board-certified radiologist was given access tothe chest X-ray image, the full radiology report, the radiology reportimpression section, the image ground truth across all conditions,and the radiologist report labels across all conditions for each ofthe 500 examples in the CheXpert test set. The radiologist then ex-plained examples where radiologists labeling reports disagree withradiologists labeling X-ray images. We also calculated the countsof disagreements between radiologists labeling reports and radiolo-gists labeling X-ray images for each condition on the CheXpert testset. A board-certified radiologist explained why there were largenumbers of disagreements on certain conditions.

We find various reasons why radiologists labelingreports might disagree with radiologists labeling images. First, thereis a difference between the setup of the report labeling and imagelabeling tasks related to the label hierarchy. On the report label-ing task on the CheXpert test set, radiologists were instructed tolabel only the most specific condition as positive and leave parentconditions blank. For example, although Lung Opacity is a parentcondition of Edema, a radiologist marking a report as positive forEdema would leave Lung Opacity blank. Blank report labels aretypically mapped to negative image labels. However, radiologistslabeling images label each condition as positive or negative inde-pendent of the presence of other conditions. Second, radiologistslabeling reports have access to clinical report history, which biases

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Jain & Smit et al.

Table 2: Clinical explanations of disagreements between radiologists looking at reports and radiologists looking at images onthe CheXpert test set. Given access to the X-ray image, the full radiology report, the radiology report impression, the radiologyreport labels, and the image ground truth, a board-certified radiologist explained disagreements between radiologist reportlabels and the image ground truth. We show select examples with explanations in this table.

Report Impression and Labels Clinical Explanation

1. single ap upright view of the chest showing a mildly increasedopacity at the left lung base that could represent atelectasisversus consolidation.

Cardiomegaly

Radiologist Report Label: NegativeImage Ground Truth: Positive The radiologist looking at the report marks Cardiomegaly asnegative as it is not mentioned in the report. Since the image is aIntensive Care Unit (ICU) film and cardiomegaly is not a clinicallyrelevant condition for the population selected for in ICU films, thepresence of cardiomegaly was never mentioned in the report,resulting in the discrepancy between radiologists looking at thereport and radiologists looking at the image.1. pulmonary vascular congestion. left lower lobe opacity comp--atible with atelectasis and/or consolidation.

Cardiomegaly

Radiologist Report Label: NegativeImage Ground Truth: Positive Although cardiomegaly was mentioned in the radiology report"Findings" section, cardiomegaly was not mentioned in the report"Impression". Since the radiologist looking at the report only hadaccess to the "Impression" section, they labeled Cardiomegaly asnegative when it was actually present in the image.1. decreased pulmonary edema. stable bilateral pleural effusionsand bibasilar atelectasis.

Edema

Radiologist Report Label: PositiveImage Ground Truth: Negative The phrase "decreased pulmonary edema" shows that the radiologistwriting the report had relevant clinical context, as the edema has"decreased" compared to a previous report or image. However, theradiologist looking at the image does not have this clinical context,resulting in a discrepancy.1. single frontal radiograph of the chest is limited secondary topoor inspiration and rotation.2. cardiac silhouette is partially obscured secondary to rotation.lungs demonstrate bibasilar opacities, likely reflectingatelectasis. possible small right pleural effusion. no pneumotho--rax.3. visualized osseous structures and soft tissues unremarkable.

Pleural Effusion

Radiologist Report Label: PositiveImage Ground Truth: Negative The phrase "possible small right pleural effusion" indicates theuncertainty regarding the presence of pleural effusion. This naturaluncertainty may explain the disagreement between radiologistslooking at the image and radiologists looking at the report. Onreview, it was noted that pleural effusion was borderline in thisexample.1. crowding of the pulmonary vasculature. cannot exclude mildinterstitial pulmonary edema.2. no focal air space consolidation. the cardiomediastinal silhou--ette appears grossly within normal limits.

Pleural Effusion

Radiologist Report Label: NegativeImage Ground Truth: Positive Upon review by a board-certified radiologist, there was an error inthe radiology report, which did not mention the presence of pleuraleffusion. The error in the report itself may explain the disagreementbetween the image and report labels. radiologists towards reporting certain conditions in reports while aradiologist labeling the image may not observe the condition on theimage. Busby et al. [3] explain biases from clinical history in termsof framing bias, where the presentation of the clinical history canlead to different diagnostic conclusions, and attribution bias, whereinformation in the clinical history can lead to different diagnosticconclusions. Third, radiologists labeling reports were only givenaccess to the report impression section when labeling the CheX-pert test set. Sometimes, conditions are mentioned in the

Findings section of the report but not mentioned in the

Impression section.This results in more negative labels when radiologists looked at reports. For chest CT scan reports, Gershanik et al. [10] also findthat a condition mentioned in the

Findings section is not alwaysmentioned in the

Impression section of the report. Fourth, labelingimages and reports is inherently noisy to a certain extent, resultingin disagreement. Drivers of noise include mistakes on the part ofradiologists labeling reports and radiologists labeling images, un-certainty regarding the presence of a condition based on an imageor report, and different thresholds for diagnosing conditions aspositive among radiologists. Brady et al. [2] describe additional fac-tors that contribute to discrepancies in radiologist interpretations, isualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA including radiologist specific causes of error like under reading aswell as system issues like excess workload.Next, we explain the counts of the largest disagreements betweenradiologists labeling reports and radiologists labeling images. Out ofthe 500 examples on the CheXpert test set, there were 223 exampleswhere the image was labeled positive while the report was labelednegative for Enlarged Cardiomediastinum. We hypothesize that thisresults from the difference in the task setup related to the label hier-archy. Since Enlarged Cardiomediastinum is a parent condition ofCardiomegaly, radiologists labeling reports were instructed to leaveEnlarged Cardiomediastinum blank if they labeled Cardiomegalypositive. There were 101 examples where the image was labeledpositive while the report was labeled negative for Cardiomegaly.Diagnosis of cardiomegaly on chest radiographs can depend onpatient positioning and clinical history. Further, particularly in theICU setting in which multiple consecutive radiographs are taken,cardiomegaly is not consistently described in the report even whenpresent unless a clinically significant change is observed (i.e. peri-cardial effusion). There were 100 examples where the image waslabeled positive while the report was labeled negative for LungOpacity. We hypothesize that this results from the difference in tasksetup related to label hierarchy, as Lung Opacity is a parent condi-tion. Further, particularly in the setting of atelectasis, lung opacitymay not have risen to clinical relevance for the reporting radiolo-gist despite being seen on the isolated imaging task. There were 65examples where the image was labeled negative while the reportwas labeled positive for Pleural Effusion. We hypothesize that thispartially results from both the variant thresholds for diagnosis ofpleural effusion among radiologists and the clinical setting in whichthe reporting radiologist has access to prior films. It was commonto see the report state "decreased" or "trace residual" effusion dueto the context of prior imaging on that patient. However, in theisolated image labeling task, the perceived likelihood of the condi-tion fell below the threshold of a board-certified radiologist. Therewere 49 examples where the image was labeled negative, whilethe report was labeled positive for Edema. Similar to the effusionexample, clinical context and prior imaging played a role in thesediscrepancies as, again, diagnoses were carried forward from priorstudies and language such as "some residual" or "nearly resolved"in the report were used to indicate the presence of edema basedon the clinical context. However, when labeling the correspondingimage in isolation, the presence of edema fell below the thresholdof a board-certified radiologist. Table 2 contains specific examplesof these disagreements with clinical explanations. Table 3 showsthe counts of disagreements between radiologists labeling reportsand radiologists labeling images by condition.

To determine whether there are significant relationships betweenconditions labeled from reports and conditions labeled from images,we learn a mapping from the output of radiologists labeling reportsto the output of radiologists labeling images. We then analyze thesignificant relationships implied by this mapping from a clinicalperspective.

Table 3: Counts of disagreements by condition between ra-diologists labeling reports and radiologists labeling the cor-responding X-ray images on the CheXpert test set. The firstcolumn reports the number of times the image ground truthwas positive, while the radiologist report label was negative.The second column reports the number of times the imageground truth was negative, while the radiologist report labelwas positive.

Condition Positive on ImageNegative on Report Negative on ImagePositive on Report

No Finding 38 40Enlarged Cardiom. 223 5Cardiomegaly 101 15Lung Opacity 100 50Lung Lesion 2 12Edema 26 49Consolidation 16 17Pneumonia 6 5Atelectasis 75 31Pneumothorax 1 13Pleural Effusion 11 65Pleural Other 3 15Fracture 3 21Support Devices 53 13

We train logistic regression models to map the ra-diologist report labels for all conditions to the image ground truthfor each of the evaluation conditions. We quantitatively measurethe relationship between the radiologist report labels and the im-age ground truth by obtaining odds ratios from the coefficients ofthese logistic regression models. We review the odds ratios fromthese models with a board-certified radiologist to understand howparticular radiologist report labels might clinically change the oddsof image labels.

We one-hot encode the radiologist reportlabels and provide these binary variables as inputs to a logisticregression model. For example, the "Atelectasis Positive" variable is1 if the radiologist labels Atelectasis as positive on the report and 0otherwise. Similarly, the "Atelectasis Negative" variable is 1 if theradiologist labels Atelectasis as negative on the report and 0 other-wise. The same logic applies to the "Atelectasis Uncertain" variableas well as the other variables for each condition. We then trainthe logistic regression model with L1 regularization ( 𝛼 = .

5) onthe CheXpert test set using the one-hot encoded radiologist reportlabels (for all conditions) as input and the image ground truth for acondition as output. In total, we train different logistic regressionmodels to map the radiologist report labels to binary image labelsfor each of the 8 evaluation conditions. We compute odds ratios byexponentiating the coefficients of the logistic regression models.

After training the logistic regression models, wefind that particular radiology report labels increased (or decreased)the odds of particular conditions in an image with statistical sig-nificance ( P < 0.05). As expected, we find that radiology reportlabels associated with a condition increase the odds of that same CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Jain & Smit et al. condition in the image; for example, a Cardiomegaly positive re-port label increases the odds of Cardiomegaly in the image. Wealso find that the regression model corrects for label hierarchy. ACardiomegaly positive report label increases the odds of EnlargedCardiomediastinum (the parent of Cardiomegaly) on the imageby 9.6 times. We similarly observe the model correcting for thelabel hierarchy of Lung Opacity. Radiology report labels of Edemapositive, Consolidation positive, and Atelectasis positive, whichall correspond to child conditions of Lung Opacity, increase theodds of Lung Opacity. We also find that the model maps particularuncertainties in report labels to the presence of a condition in theimage. For example, Atelectasis uncertain report labels and Edemauncertain report labels increase the odds of Lung Opacity by 2.9and 7.9 times respectively.Next, we find that the model maps positive report labels to thepresence of other conditions in the image. A Pleural Effusion posi-tive report label increases the odds of Lung Opacity by 4.4 times. Wehypothesize that this results from co-occurrence between PleuralEffusion and child conditions of Lung Opacity such as Atelectasisand Edema. Pleural effusion physiologically often leads to adjacentlung collapse, atelectasis, and is often seen in physiologic fluidoverload conditions, edema. We find that an Atelectasis positivereport label decreases the odds of Support Devices in the image by0.28 times. On the patient population who have support devices,many of whom are in an Intensive Care Unit (ICU) setting, it isnot clinically useful for radiologists to comment on the presence ofatelectasis on reports, as they would rather focus on more clinicallyrelevant changes. This may explain the mechanism by which thepresence of atelectasis in a report signals that there are no sup-port devices in the image. We find that a Fracture positive reportlabel decreases the odds of Support Devices by 0.17 times. We hy-pothesize that this results from a negative co-occurrence betweenFractures and Support Devices, as the two observations select fordifferent patient populations: X-rays for fractures are often done inthe Emergency Department (ED) or other outpatient settings ratherthan the ICU setting. We find that an Edema positive report labelincreases the odds of Enlarged Cardiomediastinum on the imageby 2.1 times. This may be explained by the fact that Edema andEnlarged Cardiomediastinum often co-occur in a clinical setting,as they can both be caused by congestive heart failure. Lastly, wefind that a Support Devices positive report label decreases the oddsof No Finding in the image by 0.03 times. This may be explained bythe fact that patients with support devices are usually in the ICUsetting and sick with other pathologies. We visualize these statisti-cally significant odds ratios for each type of radiologist report label(such as "Atelectasis Negative") as a factor for the presence of anevaluation condition in the X-ray image in Figure 2.

We map the output of an automated radiology report labeler toX-ray image labels using simple uncertainty handling strategies.

For a baseline approach, we naively map labelsobtained from running the CheXpert labeler on the radiology reportimpressions to X-ray image labels. The CheXpert labeler is anautomatic, rules-based radiology report labeler [13]. The labels

Table 4: F1 scores obtained by the Zero-One and LogRegbaselines, evaluated on the CheXpert test set. The weightedaverage is weighted by prevalence (n =

Condition(n =

Atelectasis (n=153) 0.52

Cardiomegaly (n=151) 0.46

Edema (n=78)

Enlarged Cardiom. (n=253) 0.20

Lung Opacity (n=264) 0.69

Support Devices (n=261)

Average 0.54

Weighted Average 0.56 produced by the CheXpert labeler include 4 classes per medicalcondition (positive, negative, uncertain, and blank). Since the imageground truth only has positive or negative labels per condition, wemust map the labels produced by the CheXpert labeler to binarylabels. We map the blank labels produced by the CheXpert labelerto negative labels. We do not change the positive and negativelabels produced by the CheXpert labeler. To handle the uncertainlabels, we use the two common uncertainty handling strategies inIrvin et al. [13]: we map the uncertain labels to either all negativelabels (zeros-uncertainty handling strategy) or all positive labels(ones-uncertainty handling strategy). We record the F1 score fromthe better performing strategy on the CheXpert test set, using asground truth the labels provided by radiologists labeling X-rayimages (image ground truth). We refer to this method as the

Zero-One Baseline . Since we only report the maximum of the zeros-uncertainty handling strategy and the ones-uncertainty handlingstrategy, the F1 scores for the Zero-One Baseline represent the mostoptimistic global mapping of the uncertainty labels for this method.

We find that the average and weighted average F1scores across the evaluation conditions for the Zero-One Baselineare 0.54 and 0.56 respectively, which are in between the average /weighted average “Low F1” and “High F1” scores for radiologistslabeling reports (see Table 1). This indicates that the Zero-OneBaseline is not strictly better or worse than radiologists labelingreports, who we previously show to have poor agreement withradiologists labeling images. The Zero-One Baseline F1 scores forAtelectasis, Cardiomegaly, Edema, Pleural Effusion, and EnlargedCardiomediastinum are 0.52, 0.46, 0.53, 0.65, and 0.20 respectively,which are all between the respective “Low F1” and “High F1” scoresfor radiologists labeling reports. The Zero-One Baseline F1 scoresfor Lung Opacity and No Finding are 0.69 and 0.39 respectively,which are slightly higher ( ∼ .

01 difference) than the respective“High F1” scores for radiologists labeling reports. Similarly, the Zero-One Baseline F1 score for Support Devices is 0.39, which is slightlylower ( ∼ .

01 difference) than the “Low F1” Support Devices scorefor radiologists labeling reports. The F1 scores for the Zero-OneBaseline across the evaluation conditions are shown in Table 4. isualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA

Figure 2: Odds ratios for radiologist report labels as factors for the presence of a condition in the X-ray image. We map theradiologist report labels across all conditions to the image ground truth using a logistic regression model. We obtain oddsratios for the input variables, which are the one-hot encoded radiologist report labels, and only display odds ratios for whichthe corresponding P value (two-sided t test) is less than 0.05.Table 5: F1 scores for BERT+Thresholding and BERT+LogReg trained on the MIMIC-CXR and CheXpert datasets. We referto the BERT+Thresholding method on the MIMIC-CXR dataset as VisualCheXbert. The models here are evaluated on theCheXpert test set. Condition(n =

Atelectasis (n=153)

Edema (n=78)

Pleural Effusion (n=104) 0.64

Enlarged Cardiom. (n=253) 0.44

Lung Opacity (n=264) 0.81

Support Devices (n=261) 0.85

No Finding (n=62) 0.44

Average 0.61

Weighted Average 0.65

We map the output of an automated radiology report labeler toX-ray image labels, similarly to how we previously map the outputof radiologists labeling reports to the output of radiologists labelingimages. Previous work by Dunnmon et al. [8] showed that labelsobtained from noisy labeling functions on radiology reports can be mapped to labels that are of similar quality to image labels producedby radiologists for the simpler task of classifying X-rays as normalor abnormal.

This approach, motivated by a prior experiment inwhich we map radiologist report labels to image labels, improvesupon the naive uncertainty mapping strategy used in the Zero-One Baseline. As before, we obtain report labels by running the

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Jain & Smit et al.

CheXpert labeler on radiology report impressions. For each of theevaluation conditions, we train a logistic regression model thatmaps the CheXpert labeler’s output on a radiology report impres-sion to a positive or negative label for the target condition. Thisapproach makes use of the automated report labels for all 14 condi-tions to predict the label for each target condition. We refer to thisapproach as the

LogReg Baseline . We one-hot encode the report labels out-putted by the CheXpert labeler and provide these binary variablesas inputs to a logistic regression model. We train a logistic regres-sion model with 𝐿 𝐶 = .

0) and a max iteration of500 using the one-hot encoded report labels (for all conditions) asinput and the image ground truth for a condition as output. Theclass weights are the inverse prevalence of the respective class inthe training set. We use a leave-one-out cross-validation strategyto train and validate the logistic regression model on the CheXperttest dataset. For each of the 8 evaluation conditions, we train differ-ent logistic regression models to map the labels produced by theCheXpert labeler to binary image labels.

We find that the LogReg Baseline approach improvesupon the Zero-One Baseline for most conditions. Compared to theZero-One Baseline, the LogReg Baseline increases the average F1score from 0.54 to 0.65 and the weighted average F1 score from 0.56to 0.70. The LogReg Baseline increases the F1 score compared to theZero-One Baseline from 0.52 to 0.63 for Atelectasis, 0.46 to 0.56 forCardiomegaly, 0.20 to 0.67 for Enlarged Cardiomediastinum, 0.69 to0.81 for Lung Opacity, and 0.39 to 0.55 for No Finding. However, theLogReg Baseline decreases the F1 scores compared to the Zero-OneBaseline from 0.53 to 0.47 for Edema and 0.85 to 0.84 for SupportDevices. For Pleural Effusion, both the LogReg Baseline and theZero-One Baseline have an F1 score of 0.65. Although the LogRegBaseline is not better than the Zero-One Baseline for all conditions,these results suggest that a learned mapping from radiologist reportlabels to X-ray image labels can outperform naively mapping alluncertain labels to positive or negative for most conditions. The F1scores obtained by the LogReg Baseline, along with head-to-headcomparisons to the Zero-One Baseline, are shown in Table 4.

Previously, we mapped the output of an existing automated reportlabeler, which takes text reports as input, to X-ray image labels. Wenow map the textual radiology report directly to the X-ray imagelabels.

We develop a deep learning model that maps aradiology report directly to the corresponding X-ray image labels.Since it is too expensive to obtain labels from radiologists forhundreds of thousands of X-ray images to supervise our model,we instead train a single DenseNet model [12] to detect medicalconditions from chest X-ray images, as is described by Irvin et al.[13], and we use this computer vision model as a proxy for a ra-diologist labeling an X-ray image. We use the DenseNet model tooutput probabilities for each of the 14 conditions for all X-raysin the MIMIC-CXR dataset and the CheXpert training dataset. Toobtain the output of the vision model on the MIMIC-CXR dataset, we train the DenseNet on the CheXpert training dataset. Similarly,to obtain the output of the vision model on the CheXpert trainingdataset, we train the DenseNet on the MIMIC-CXR dataset. Wefind that the DenseNet trained on the CheXpert training set has anAUROC of 0.875 on the CheXpert test set across all conditions, andthe DenseNet trained on the MIMIC-CXR dataset has an AUROCof 0.883 on the CheXpert test set across all conditions.We then use the probabilities outputted from these computervision models as ground truth to fine-tune a BERT-base model. Wetrain one BERT model using the MIMIC-CXR dataset and one usingthe CheXpert training dataset. The BERT model takes a tokenizedradiology report impression from the MIMIC-CXR or CheXpertdataset as input and is trained to output the labels produced by theDenseNet model. We feed the BERT model’s output correspondingto the [CLS] token into linear heads (one head for each medicalcondition) to produce scores for each medical condition. We use thecross-entropy loss to fine-tune BERT. The BERT model is initializedwith biomedically pretrained weights produced by Peng et al. [21].This model training process is shown in Figure 1.After training the BERT model, we map the outputs of BERT,which are probabilities, to positive or negative labels for each con-dition. To do so, we try two different methods. Our first methoduses optimal probability thresholds to convert the BERT outputs tobinary labels. We calculate optimal thresholds by finding the thresh-old for each condition that maximizes Youden’s index [34] (the sumof sensitivity and specificity minus one) on the CheXpert validationdataset. We refer to this approach as

BERT+Thresholding . Our sec-ond method trains a logistic regression model to map the output ofBERT across all 14 conditions to a positive or negative label for thetarget condition. We refer to this approach as

BERT+LogReg . Ulti-mately, we develop four different models by using both methods onoutputs from a BERT model trained on the MIMIC-CXR dataset anda BERT model trained on the CheXpert training dataset. The fourresulting models are called BERT+Thresholding on MIMIC-CXR,BERT+LogReg on MIMIC-CXR, BERT+Thresholding on CheXpert,and BERT+LogReg on CheXpert. We refer to the BERT+LogRegmodel trained on the MIMIC-CXR dataset with labels provided bythe DenseNet model, which is our best performing approach, as

VisualCheXbert . We train the BERT model on 3 TITAN-XPGPUs using the Adam optimizer [15] with a learning rate of 2 × − , following Devlin et al. [7] for fine-tuning tasks. We use arandom 85%-15% training-validation split, as in Smit et al. [28]. TheBERT model is trained until convergence. We use a batch size of 18radiology report impressions. For the BERT+LogReg approach, thelogistic regression model uses 𝐿 𝐶 = .

0) and a maxiteration of 500. Similar to the LogReg Baseline, the class weightsare the inverse prevalence of the respective class in the training set,and we use a leave-one-out cross-validation strategy to train andtest the logistic regression model on the CheXpert test dataset. Wetrain different logistic regression models to map the probabilitiesoutputted by the BERT model to the binary image labels for eachof the 8 evaluation conditions.

We compare the performance of the different BERTapproaches on the CheXpert test set. First, we find that on mostconditions, BERT+LogReg outperforms BERT+Thresholding. This isualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA

Table 6: Improvement in F1 score obtained by VisualCheXbert, evaluated on the CheXpert test set and reported with 95%confidence intervals. The left-most column shows the improvement over the Zero-One Baseline. The middle column showsthe improvement over the radiologist report labels with uncertains mapped to the image ground truth label. The right-mostcolumn shows the improvement over the radiologist report labels with uncertains mapped to the opposite of image groundtruth label.

Condition(n =

Atelectasis (n=153) 0.12 (0.04, 0.20) 0.04 (-0.04, 0.12) 0.41 (0.32, 0.49)Cardiomegaly (n=151) 0.16 (0.07, 0.25) 0.15 (0.07, 0.25) 0.20 (0.11, 0.28)Edema (n=78) 0.01 (-0.05, 0.07) -0.04 (-0.09, 0.02) 0.09 (0.02, 0.17)Pleural Effusion (n=104) -0.01 (-0.04, 0.03) -0.06 (-0.10, -0.02) 0.01 (-0.03, 0.05)Enlarged Cardiom. (n=253) 0.53 (0.46, 0.60) 0.52 (0.44, 0.60) 0.64 (0.57, 0.71)Lung Opacity (n=264) 0.14 (0.09, 0.20) 0.15 (0.09, 0.20) 0.15 (0.10, 0.20)Support Devices (n=261) 0.02 (-0.01, 0.06) 0.01 (-0.02, 0.04) 0.01 (-0.02, 0.04)No Finding (n=62) 0.15 (0.05, 0.26) 0.16 (0.05, 0.28) 0.16 (0.05, 0.28)Average 0.14 (0.12, 0.17) 0.12 (0.09, 0.15) 0.21 (0.18, 0.24)Weighted Average 0.17 (0.15, 0.20) 0.15 (0.13, 0.18) 0.24 (0.21, 0.26) finding holds true on both the CheXpert and MIMIC-CXR datasets.Second, we find that despite being trained on datasets from differ-ent institutions, the models trained on MIMIC-CXR and CheXpertdatasets perform similarly. This indicates that the BERT modeltrained on radiology report impressions from the MIMIC-CXR dis-tribution (Beth Israel Deaconess Medical Center Emergency De-partment between 2011–2016) [14] can perform as well as a modeltrained on radiology report impressions from the CheXpert distri-bution (Stanford Hospital between 2002-2017) [13], even when bothmodels are evaluated on a test set from the CheXpert distribution.Since we obtain a slightly higher average and weighted average F1using the MIMIC-CXR dataset, we use BERT trained on MIMIC-CXR in our final approach called VisualCheXbert. The performanceof the BERT approaches is shown in Table 5.Next, we compare VisualCheXbert to the Zero-One Baseline.When comparing VisualCheXbert to the Zero-One Baseline as wellas the higher and lower scores of radiologists labeling reports de-scribed below, we report the improvements by computing the paireddifferences in F1 scores on 1000 bootstrap replicates and providingthe mean difference along with a 95% two-sided confidence interval[9]. Overall, VisualCheXbert improves the average F1 and weightedaverage F1 over the Zero-One Baseline with statistical significance,increasing the average F1 score by 0.14 (95% CI 0.12, 0.17) andthe weighted average F1 score by 0.17 (95% CI 0.15, 0.20). We findthat VisualCheXbert obtains a statistically significant improvementover the Zero-One Baseline on most conditions. VisualCheXbert in-creases the F1 score on Enlarged Cardiomediastinum, Cardiomegaly,No Finding, Lung Opacity, and Atelectasis compared to the Zero-One Baseline by 0.53 (95% CI 0.46, 0.60), 0.16 (95% CI 0.07, 0.25), 0.15(95% CI 0.05, 0.26), 0.14 (95% CI 0.09, 0.20), and 0.12 (95% CI 0.04,0.20), respectively. VisualCheXbert obtains similar performance(no statistically significant difference) to the Zero-One Baseline onthe rest of the conditions, which are Edema, Pleural Effusion, andSupport Devices, with improvements of 0.01 (95% CI -0.05, 0.07),-0.01 (95% CI -0.04, 0.03), and 0.02 (95% CI -0.01, 0.06), respectively. Lastly, we compare the F1 scores for VisualCheXbert to thehigher and lower scores of radiologists labeling reports. The higherscores for radiologists labeling reports are obtained by mapping theuncertain radiologist report labels to the image ground truth label,while the lower scores for radiologists labeling reports are obtainedby mapping the uncertain radiologist report labels to the oppositeof the ground truth. Overall, VisualCheXbert obtains a statisticallysignificant improvement over both the higher and lower radiolo-gist scores, increasing the average F1 score by 0.12 (95% CI 0.09,0.15) over the higher radiologist score and 0.21 (95% CI 0.18, 0.24)over the lower radiologist score and increasing the weighted aver-age F1 score by 0.15 (95% CI 0.13, 0.18) over the higher radiologistscore and 0.24 (95% CI 0.21, 0.26) over the lower radiologist score.Statistically significant improvements over the higher radiologistscore are observed for Cardiomegaly (0.15 [95% CI 0.07, 0.25]), En-larged Cardiomediastinum (0.52 [95% CI 0.44, 0.60]), Lung Opacity(0.15 [95% CI 0.09, 0.20]), and No Finding (0.16 [95% CI 0.05, 0.28]).VisualCheXbert performs similarly (no statistically significant dif-ference) to the higher radiologist score on Atelectasis (0.04 [95% CI-0.04, 0.12]), Edema (-0.04 [95% CI -0.09, 0.02]), and Support Devices(0.01 [95% CI -0.02, 0.04]). VisualCheXbert performs slightly worsethan the higher radiologist score on one condition, which is Pleu-ral Effusion (-0.06 [95% CI -0.10, -0.02]). VisualCheXbert observesconsiderable, statistically significant improvements compared tothe lower radiologist score on all but two conditions. There is nostatistically significant difference between VisualCheXbert and thelower radiologist score on these two conditions, which are Pleu-ral Effusion (0.01 [95% CI -0.03, 0.05]) and Support Devices (0.01[95% CI -0.02, 0.04]). We show the improvements obtained by Vi-sualCheXbert over the Zero-One Baseline and the improvementsover radiologists labeling reports in Table 6.

Our work has the following limitations. First, our study only madeuse of the

Impression section of the radiology reports, which is asummary of the radiology report. Prior work regarding automated

CM CHIL ’21, April 8–10, 2021, Virtual Event, USA Jain & Smit et al. chest X-ray labeling has also extensively used the impression sec-tion in radiology reports [13, 14, 32]. However, conditions are some-times mentioned in the

Findings section of the report but not inthe

Impression section. As a result, negative and blank labels aremore frequent when using the

Impression section, and this couldincrease the disparity between labels extracted from the impres-sion and the corresponding chest X-ray image labels. Second, theVisualCheXbert model has a maximum input size of 512 tokens.In practice, only 3 of the report impressions in the entire CheX-pert dataset were longer than this limit. Third, the CheXpert testset, on which we evaluated our models, consists of 500 radiologystudies and is therefore limited in size. As a result, some of themedical conditions contained very few positive examples; we onlyevaluated our models on conditions for which at least 10% of theexamples in the CheXpert test set were positive. Using a larger testset would allow evaluation on rarer conditions. Fourth, our modelsare evaluated on chest X-rays from a single institution. Furtherevaluation on data from other institutions could be used to evaluatethe generalizability of our models.

We investigate the discrepancy between labels extracted from ra-diology reports and the X-ray image ground truth labels. We thendevelop and evaluate methods to address this discrepancy. In ourwork, we aim to answer the following questions.

Do radiologists labeling reports agree with radiologists labeling X-ray images?

We find that there is significant disagreement betweenradiologists labeling reports and radiologists labeling images. Onthe CheXpert test set, we observe low Kappa scores for almostall conditions evaluated. The average Kappa across the evaluationconditions is between 0.312 and 0.430. These bounds are basedon the most pessimistic mapping and most optimistic mapping ofuncertain radiology report labels.

Why do radiologists labeling reports disagree with radiologistslabeling X-ray images?

Upon a board-certified radiologist review ofexamples of disagreements between radiologists labeling reportsand radiologists labeling images, we find four main reasons fordisagreement. First, on the CheXpert test set, radiologists labelingreports typically do not mark a parent condition as positive if achild condition is positive. An example of a parent and child con-dition would be Lung Opacity and Edema, respectively. Second,radiologists labeling reports have access to clinical report history,which biases their diagnoses compared to radiologists labeling im-ages who do not have access to this information. Third, conditionsare sometimes reported in the

Findings section of radiology reportsbut not the

Impression section of radiology reports. However, the

Impression section of radiology reports is commonly used to labelreports. This discrepancy can cause radiologists labeling reportsto miss pathologies present on the X-ray image. Fourth, labelingimages and reports is noisy to a certain extent due to factors suchas human mistakes, uncertainty in both reports and images, anddifferent thresholds for diagnosing conditions as positive amongradiologists.

Are there significant relationships between conditions labeled fromreports and conditions labeled from images?

We find many signif-icant relationships between conditions labeled from reports and conditions labeled from images. We report and clinically interpretvarious radiology report labels that increase (or decrease) the oddsof particular conditions in an image with statistical significance ( P <0.05). As expected, we find that positive report labels for a conditionincrease the odds of that condition in an image. We find that posi-tive report labels for children of a condition increase the odds ofthe parent condition in an image, a phenomenon that is correctingfor the label hierarchy. We find that particular uncertain reportlabels for a condition increase the odds of the condition (and/or itsparent condition). We also find that positive report labels for certainconditions increase (or decrease) the odds of other conditions inthe image. One example is that a positive Atelectasis report labeldecreases the odds of Support Devices in the X-ray image by 0.28times. We explain potential mechanisms by which the presence ofa condition in a report signals the presence (or absence) of anothercondition in the image. Can we learn to map radiology reports directly to the X-ray im-age labels?

We learn to map a textual radiology report directly tothe X-ray image labels. We use a computer vision model trainedto detect diseases from chest X-rays as a proxy for a radiologistlabeling an X-ray image. Our final model, VisualCheXbert, usesa biomedically-pretrained BERT model that is supervised by thecomputer vision model. When evaluated on radiologist image labelson the CheXpert test set, VisualCheXbert increases the average F1score across the evaluation conditions by between 0.12 (95% CI 0.09,0.15) and 0.21 (95% CI 0.18, 0.24) compared to radiologists labelingreports. VisualCheXbert also increases the average F1 score by 0.14(95% CI 0.12, 0.17) compared to a common approach that uses aprevious rules-based radiology report labeler.Given the considerable, statistically significant improvementobtained by VisualCheXbert over the approach using an existingradiology report labeler [13] when evaluated on the image groundtruth, we hypothesize that VisualCheXbert’s labels could be usedto train better computer vision models for automated chest X-raydiagnosis.

ACKNOWLEDGMENTS

We would like to acknowledge the Stanford Machine LearningGroup (stanfordmlgroup.github.io) and the Stanford Center forArtificial Intelligence in Medicine and Imaging (AIMI.stanford.edu)for infrastructure support.

REFERENCES [1] Michael David Abràmoff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon,James C Folk, and Meindert Niemeijer. 2016. Improved Automated Detectionof Diabetic Retinopathy on a Publicly Available Dataset Through Integration ofDeep Learning.

Invest Ophthalmol Vis Sci

57, 13 (Oct 2016), 5200–5206. https://doi.org/10.1167/iovs.16-19964[2] Adrian Brady, Risteárd Ó Laoide, Peter McCarthy, and Ronan McDermott. 2012.Discrepancy and error in radiology: concepts, causes and consequences.

TheUlster medical journal

81, 1 (2012), 3.[3] Lindsay P Busby, Jesse L Courtier, and Christine M Glastonbury. 2018. Bias inradiology: the how and why of misses and misinterpretations.

Radiographics

Medical Image Analysis

66 (Dec 2020), 101797. https://doi.org/10.1016/j.media.2020.101797[5] Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales.

Educationaland Psychological Measurement

20, 1 (2021/01/13 1960), 37–46. https://doi.org/10.1177/001316446002000104 isualCheXbert: Addressing the Discrepancy Between Radiology Report Labels and Image Labels ACM CHIL ’21, April 8–10, 2021, Virtual Event, USA [6] Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan,Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald.2016. Preparing a collection of radiology examinations for distribution andretrieval.

J Am Med Inform Assoc

23, 2 (Mar 2016), 304–310. https://doi.org/10.1093/jamia/ocv080[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.arXiv:1810.04805 [cs.CL][8] Jared Dunnmon, Alexander Ratner, Nishith Khandwala, Khaled Saab, MatthewMarkert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew Lun-gren, Daniel Rubin, and Christopher Ré. 2019. Cross-Modal Data ProgrammingEnables Rapid Medical Machine Learning. arXiv:1903.11101 [cs.LG][9] Bradley Efron and Robert Tibshirani. 1986. Bootstrap methods for standarderrors, confidence intervals, and other measures of statistical accuracy.

Statisticalscience (1986), 54–75.[10] Esteban F Gershanik, Ronilda Lacson, and Ramin Khorasani. 2011. Criticalfinding capture in the impression section of radiology reports. In

AMIA AnnualSymposium Proceedings , Vol. 2011. American Medical Informatics Association,465.[11] Varun Gulshan, Lily Peng, Marc Coram, Martin C. Stumpe, Derek Wu, Arunacha-lam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams,Jorge Cuadros, Ramasamy Kim, Rajiv Raman, Philip C. Nelson, Jessica L. Mega,and Dale R. Webster. 2016. Development and Validation of a Deep LearningAlgorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

JAMA

NPJ digital medicine

3, 1(2020), 1–8.[27] George Shih, Carol C. Wu, Safwan S. Halabi, Marc D. Kohli, Luciano M. Prevedello,Tessa S. Cook, Arjun Sharma, Judith K. Amorosa, Veronica Arteaga, MayaGalperin-Aizenberg, Ritu R. Gill, Myrna C.B. Godoy, Stephen Hobbs, Jean Jeudy,Archana Laroia, Palmi N. Shah, Dharshan Vummidi, Kavitha Yaddanapudi, andAnouk Stein. 2019. Augmenting the National Institutes of Health Chest Ra-diograph Dataset with Expert Annotations of Possible Pneumonia.

Radiology:Artificial Intelligence

1, 1 (2019), e180041. https://doi.org/10.1148/ryai.2019180041arXiv:https://doi.org/10.1148/ryai.2019180041[28] Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng,and Matthew P. Lungren. 2020. CheXbert: Combining Automatic Labelersand Expert Annotations for Accurate Radiology Report Labeling Using BERT.arXiv:2004.09167 [cs.CL][29] Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A. Dunn-mon, James Zou, and Daniel L. Rubin. 2020. Data Valuation for Medical Imag-ing Using Shapley Value: Application on A Large-scale Chest X-ray Dataset.arXiv:2010.08006 [cs.LG][30] Yu-Xing Tang, You-Bao Tang, Yifan Peng, Ke Yan, Mohammadhadi Bagheri,Bernadette A. Redd, Catherine J. Brandon, Zhiyong Lu, Mei Han, Jing Xiao,and Ronald M. Summers. 2020. Automated abnormality classification of chestradiographs using deep convolutional neural networks. npj Digital Medicine

3, 1(2020), 70. https://doi.org/10.1038/s41746-020-0273-z[31] Linda Wang, Zhong Qiu Lin, and Alexander Wong. 2020. COVID-Net: a tailoreddeep convolutional neural network design for detection of COVID-19 cases fromchest X-ray images.

Scientific Reports

10, 1 (2020), 19549.[32] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, andRonald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database andbenchmarks on weakly-supervised classification and localization of commonthorax diseases. In

Proceedings of the IEEE conference on computer vision andpattern recognition . 2097–2106.[33] Wenwu Ye, Jin Yao, Hui Xue, and Yi Li. 2020. Weakly Supervised Lesion Local-ization With Probabilistic-CAM Pooling. arXiv:2005.14480 [cs.CV][34] William J Youden. 1950. Index for rating diagnostic tests.