Creation and Validation of a Chest X-Ray Dataset with Eye-tracking and Report Dictation for AI Development
Alexandros Karargyris, Satyananda Kashyap, Ismini Lourentzou, Joy Wu, Arjun Sharma, Matthew Tong, Shafiq Abedin, David Beymer, Vandana Mukherjee, Elizabeth A Krupinski, Mehdi Moradi
CCreation and Validation of a Chest X-Ray Datasetwith Eye-tracking and Report Dictation for AIDevelopment
Alexandros Karargyris
1, * , Satyananda Kashyap , Ismini Lourentzou , Joy Wu ,Arjun Sharma , Matthew Tong , Shafiq Abedin , David Beymer , Vandana Mukherjee ,Elizabeth A Krupinski , and Mehdi Moradi
1, * IBM Research, Almaden Research Center, San Jose, CA, 95120, USA Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA, 30322, USA * corresponding author(s): Alexandros Karargyris ([email protected]), Satyananda Kashyap([email protected]), Mehdi Moradi ([email protected]) † these authors contributed equally to this work ABSTRACT
We developed a rich dataset of Chest X-Ray (CXR) images to assist investigators in artificial intelligence. The data werecollected using an eye tracking system while a radiologist reviewed and reported on 1,083 CXR images. The dataset containsthe following aligned data: CXR image, transcribed radiology report text, radiologist’s dictation audio and eye gaze coordinatesdata. We hope this dataset can contribute to various areas of research particularly towards explainable and multimodal deeplearning / machine learning methods. Furthermore, investigators in disease classification and localization, automated radiologyreport generation, and human-machine interaction can benefit from these data. We report deep learning experiments thatutilize the attention maps produced by eye gaze dataset to show the potential utility of this data.
Background & Summary
In recent years, artificial intelligence (AI) has been extensively explored for enhancing the efficacy and efficiency of theradiology interpretation and reporting process. As the current prevalent paradigm of AI is deep learning, many of the works inAI for radiology use large data sets of labelled radiology images to train deep neural networks to classify images accordingto disease classes. Given the high labor cost of annotating images with the areas depicting the disease, large public trainingdatasets often come with global labels describing the whole image without localized annotation of the disease areas. Thedeep neural network model is trusted with discovering the relevant part of the image and learning the features characterizing thedisease. This limits the performance of the resulting network. Furthermore, the black-box nature of deep neural networks andlack of local annotations means that the process of developing disease classifiers does not take advantage of experts’ knowledgeof disease appearance and location in medical images. The result is a multi-layer and nonlinear model with serious concernswith respect to explainability of its output. Furthermore, the generalization capability, (i.e., when the model is deployed to inferthe class labels for images from other sources or distributions) is affected by scanner differences and/or demographic changes isa well-studied concern .In the past five decades eye tracking has been extensively used in radiology for education, perception understanding, andfatigue measurement (example reviews: ). More recently, efforts have used eye tracking data to improve segmentationand disease classification in Computed Tomography (CT) radiography by integrating them in deep learning techniques. Withsuch evidence and with the lack of public datasets that capture eye gaze data in the chest X-Ray (CXR) space, we present a newdataset that can help improve the way machine learning models are developed for radiology applications and we demonstrateits use in some popular deep learning architectures.This dataset consists of eye gaze information recorded from a single radiologist interpreting frontal chest radiographs.Dictation data (audio and timestamped text) of the radiology report reading is also provided. We also generated bounding boxescontaining anatomical structures on every image and share them as part of this dataset. These bounding boxes can be used inconjunction with eye gaze information to produce more meaningful analyses.We present evidence that this dataset can help with two important tasks for AI practitioners in radiology:• The coordinates marking the areas of the image that a radiologist looks at while reporting a finding provide an approximate a r X i v : . [ c s . C V ] O c t egion of interest/attention for that finding. Without altering a radiologist’s routine, this approach presents an inexpensiveand efficient method for generating a locally annotated collection of images for training machine learning algorithms(e.g., disease classifiers). Since we also share the ground truth bounding boxes, the validity of the eye tracking in markingthe location of the finding can be further studied using this dataset. We demonstrate utilization of eye gaze in deep neuralnetwork training and show an improvement in performance can be obtained.• Tracking of the eyes can characterize how radiologists approach the task of reading radiographs. Study of the eye-gaze ofradiologists while reading normal and disease radiographs, presented as attention maps, reveals a cognitive workflowpattern that AI developers can use when building their models.We invite researchers in the radiology community who wish to contribute to the further development of the dataset tocontact us. Methods
Figure 1 provides an overview of the study and data generation process. For this study we used the publicly available MIMIC-CXR Database in conjunction with the publicly available Emergency Department (ED) subset of the MIMIC-IV ClinicalDatabase . The MIMIC-IV-ED subset contains clinical observations/data and outcomes related to some of the CXR exams inthe MIMIC-CXR database. Inclusion and exclusion criteria were applied to the patient attributes and clinical outcomes (via thedischarge diagnosis, a.k.a the ICD-9 code) recorded in the MIMIC-IV Clinical Database, resulting in a subset of 1,083 casesthat equally cover 3 conditions: Normal, Pneumonia and Congestive Heart Failure (CHF). The corresponding CXR images ofthese cases were extracted from the MIMIC-CXR database . A radiologist (American Board of Radiology certified with over5 years of experience) performed routine radiology reading of the images using the Gazepoint GP3 Eye Tracker (i.e., eyetracking device), Gazepoint Analysis UX Edition software (i.e., software for performing eye gaze experiments), a headsetmicrophone, a desktop computer and a monitor (Dell S2719DGF) set at 1920x1080 resolution. Radiology reading took placein multiple sessions (i.e., 30 cases per session) over a period of 2 months (i.e., March - May 2020). The Gazepoint AnalysisUX Edition exported video files (.avi format) containing eye fixations and voice dictation of radiologist’s reading alongwith spreadsheets (.csv format) containing eye tracker’s recorded eye gaze data. The audio was extracted from the video filesand saved in wav and mp3 format. Subsequently, these audio files were processed with speech-to-text software (i.e., GoogleSpeech-to-Text) to extract text transcripts along with dictation word time-related information (.json format). Furthermore, thesetranscripts were manually corrected. The final dataset contains the raw eye gaze signal information (.csv), audio files (.wav,.mp3) and transcript files (.json). Ethical Statement
The source data from the MIMIC-CXR and MIMIC-IV databases have been previously de-identified, and the institutionalreview boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center(2001-P-001699/14) both approved the use of the databases for research. We have also complied with all relevant ethicalregulations with the use of the data for our study.
Data Preparation
Inclusion and Exclusion Criteria
Figure 2 describes the inclusion/exclusion criteria used to generate this dataset. These criteria were applied on the MIMIC-IVClinical Database to identify the CXR studies of interest. The studies were used to extract their corresponding CXR imagesfrom the MIMIC-CXR Database .We selected two clinically prevalent and high impact diseases, pneumonia and congestive heart failure (CHF), in theEmergency Department (ED) setting. We also picked normal cases as a comparison class. Unlike related CXR labeling efforts, where the same labels are derived from radiology reports using natural language processing (NLP) alone, the ground truth forour pneumonia and CHF class labels were derived from unique discharge ICD-9 codes (verified by our clinicians) from theMIMIC-IV-ED tables. This ensures the ground truth is based on a formal clinical diagnosis and is likely to be more reliable, given that ICD-9discharge diagnoses are typically derived from a multi-disciplinary team of treating providers after having considered allclinically relevant information (e.g., bedside observations, labs) in addition to the CXR images. This is particularly importantsince CXR observations alone may not always be specific enough to reach a pneumonia or CHF clinical diagnosis. Thenormal class is determined by excluding any ICD-9 codes that may result in abnormalities visible on CXRs and also havingno abnormal labels extracted from the relevant CXR reports using CXR report labeler . The code to run the inclusion andexclusion criteria is available on our GitHub repository. n addition, our sampling criteria prioritized the strategy for getting a good number of examples of disease features acrossa range of ages and sex from the source ED population. The goal is to support building and evaluation of computer visionalgorithms that do not overly rely on age and sex biases, which may depict prominent visual features , to predict diseaseclasses. Preparation of Images
The 1,083 CXR images (Inclusion and Exclusion Criteria section) were converted from DICOM (Digital Imaging andCommunications in Medicine) format to .png format: normalized (0-255), resized and padded to 1920x1080 to fit theradiologist’s computer’s monitor resolution (i.e., kept same aspect ratio) and to enable loading into Gazepoint Analysis UXEdition .A calibration image (i.e., resolution: 1920x1080 pixels) consisting of a white dot (30 pixels in radius) was generated (seeFigure 3 - left). The calibration image was presented to the radiologist randomly during data collection to measure eye gazeoffset (see Figure 3 - right).The 1,083 images and calibration images were split into 38 numbered folders (i.e., ’01’, ’02’, ’03’,...’38’) with no morethan 30 images per folder. These folders were then uploaded to IBM’s internal BOX TM and shared with the radiologist whofinally downloaded and loaded them to Gazepoint Analysis UX Edition software to perform the reading (i.e., data collection). Data Collection
Software and Hardware Setup
The Gazepoint GP3 Eye Tracker is an optical eye tracker that uses infrared light to detect eye fixation. The Gazepoint AnalysisUX Edition software is a software suite that comes along with the Gazepoint GP3 Eye Tracker and allows performing eyegaze experiments on image series.The Gazepoint GP3 Eye Tracker was set up in the radiologist’s routine working environment on a Windows desktop PCcomputer (connected at USB 3 port). The Gazepoint Analysis UX Edition software was also installed on the same computer.Each session was a standalone experiment that contained up to 30 images for reading by the radiologist. The radiologist’s eyeswere 28 inches away from the monitor. The choice of this number of images was intentional to avoid fatigue and interruptionsand to allow for timely offline review and quality assurance of each session recordings by the rest of the team. GazepointAnalysis UX Edition software allows for 9-point calibration which occurred in the beginning of each session. In addition,Gazepoint Analysis UX Edition allows the user to move to the next image either by pressing the spacebar on the keyboardwhen done with a case or by waiting for a fixed time. In this way the radiologist was able to move to the next CXR image whenhe was done with a given image, making the experiment easier. Radiology Reading
The radiologist read 1,083 CXR images reporting in unstructured prose, same as what he would perform in his routine workingenvironment. The goal was to simulate a typical radiology read with minimal disruption from the eye gaze data collectionprocess. The order of the CXR images was randomized to allow a blinded radiology read. In addition, we intentionally withheldthe reason for exam information from our radiologist in order to collect an objective CXR exam interpretation based only onthe available imaging features.The original source MIMIC-CXR Database has the original de-identified free text reports for the same images, whichwere collected in real clinical scenarios where the reading radiologists had access to some patient clinical information outsidethe CXR image. The radiologists may even have had discussion about the patients with the bedside treating physician.Interpreting CXRs with additional patient clinical information (e.g., age, sex, other signs or symptoms) has the benefit ofallowing radiologists to provide a narrower list of disease differential diagnosis by reasoning with their extra medical knowledge.However, it may also have the unintended effect of narrowing the radiology finding descriptions or subconsciously biasing whatthe radiologists look for in the image. In contrast, our radiologist only had the clinical information that all the CXRs came froman ED clinical setting.By collecting a more objective read, we ensured that the CXR images used in this dataset have associated reports from bothkinds of reading scenarios (read with and without patient clinical information). The goal is to broaden the range of possibletechnical and clinical research questions that future researchers working with the dataset may ask and explore. Data Post-Processing
At the end of each session the radiologist exported the following information from the Gazepoint Analysis UX Editionsoftware : 1) fixation spreadsheet (.csv) containing fixation information for each case in the session, 2) eye gaze spreadsheet(.csv) containing raw eye gaze information for each case in the session, and 3) videos files (.avi) containing audio (i.e.radiologist’s dictation) along with his eye gaze fixation heatmaps per session case (see Figure 4). These files were uploaded and hared over IBM’s internal BOX TM subscribed service. A team member reviewed each video for any technical quality issues(e.g., corrupted file, video playback stopped abruptly, bad audio quality).Once data collection (i.e., 38 sessions) finished, the following post-processing tasks were performed. Spreadsheet Merging
From all sessions (i.e., folders), the fixations spreadsheets were concatenated into a single spreadsheet file: fixations.csv ,and the raw eye gaze spreadsheets were concatenated into a single spreadsheet file: eye_gaze.csv . Mapping of eye gazeand fixation from screen coordinate system to the original MIMIC image coordinate system was also performed at this stage.Detailed descriptions of these tables are provided in the Data Records section.
Audio Extraction and Transcript Generation
For each session video file (i.e., containing radiologist’s eye gaze fixations and dictation in .avi format, Figure 4) the dictationaudio was extracted and saved in audio.wav and audio.mp3 files. We used Google Speech-to-Text to transcribe the audio(i.e., wav file) into text. Transcribed text was saved in transcript.json containing timestamps and corresponding wordsbased on the API example found in documentation. Furthermore, the transcripts were corrected manually by three (3) teammembers (all verified by the radiologist) using the original audio. An example of a transcript json is given in the Data Recordssection.
Segmentation Maps and Bounding Boxes for Anatomies
Two supplemental datasets are also provided to enrich this dataset:•
Segmentation maps:
Four (4) key anatomical structures per image were generated: i) left_lung.png , ii) right_-lung.png , iii) mediastinum.png and iv) aortic_knob.png . These anatomical structures were automaticallysegmented by an internal segmentation model and then manually reviewed and corrected by the radiologist. Eachimage has pixel values 255 for anatomy and 0 for background. Figure 5 presents a sample case with its correspondingsegmentation maps.•
Bounding boxes : An extension of a bounding box extraction pipeline was used to extract 17 anatomical boundingboxes for each CXR image, which include: ‘right lung’, ‘right upper lung zone’, ‘right mid lung zone’, ‘right lower lungzone’, ‘left lung’, ‘left upper lung zone’, ‘left mid lung zone’, ‘left lower lung zone’, ‘right hilar structures’, ‘left hilarstructures’, ‘upper mediastinum’, ‘cardiac silhouette’, ‘trachea’, ‘right costophrenic angle’, ‘left costophrenic angle’,‘right clavicle’, ‘left clavicle’. These zones cover the clinically most important anatomies on a Posterior Anterior (PA)CXR image. These automatically produced bounding boxes were manually corrected (when required). Each boundingbox is described by the top left corner point (X X1 , Y Y1 ) and bottom right corner point (X X2 , Y Y2 ) on the original CXRimage coordinate system. Figure 6 shows an example of anatomical bounding boxes. The information for boundingboxes of the 1,083 images are contained in bounding_boxes.csv Researchers can utilize these two (2) supplemental datasets to improve segmentation and disease localization algorithmsby combining them with the eye gaze data. In the Statistical Analysis on Fixations subsection we utilize the bounding_-boxes.csv to perform statistical analysis between fixations and condition pairs.
Data Records
An overview of the released dataset with their relationships is provided in Figure 7. Specifically four (4) data documents andone (1) folder are provided:1. master_sheet.csv : Spreadsheet containing MIMIC DICOM ids along with study clinical indication sentence,report derived finding labels, and ICD-9 derived outcome disease labels.2. eye_gaze.csv : Spreadsheet containing raw eye gaze data as exported by Gazepoint Analysis UX Edition software. fixations.csv : Spreadsheet containing fixation data as exported by Gazepoint Analysis UX Edition software. bounding_boxes.csv : Spreadsheet containing bounding box coordinates for key frontal CXR anatomical structures.5. audio_segmentation_transcripts : Folder containing dictation audio files (i.e. mp3, wav), transcript file (i.e.json), anatomy segmentation mask files (i.e. png) for each dicom id.The dataset is hosted at PhysioNet . To utilize the dataset, the only requirement for the user is to obtain Physionet accessto the MIMIC-CXR Database in order to download the original MIMIC CXR images in DICOM format. The dicom-id tag found throughout all the dataset documents maps records to the MIMIC CXR images. A detailed description of each datadocument is provided in the following subsections. aster Spreadsheet The master spreadsheet ( master_sheet.csv ) provides the following key information:• The dicom-id column maps each row to the original MIMIC CXR image as well as the rest of the documents in thisdataset.• The study-id column maps the CXR image/dicom to the associated CXR report, which can be found from the sourceMIMIC-CXR dataset .• For each CXR study ( study-id ), granular radiology ‘finding’ labels have been extracted from the associated originalMIMIC reports by two different NLP pipelines – first is the CheXpert NLP pipeline , and second is an NLP pipelinedeveloped internally .• Additionally, for each CXR study ( study-id ), the reason for exam indication has been sectioned out from the originalMIMIC CXR reports. The indication sentence(s) tend to contain patient clinical information that may not otherwise bevisible from the CXR image alone.Table 1 describes in detail each column found in the master spreadsheet. Fixations and Eye Gaze Spreadsheets
The eye gaze information is stored in two (2) files: a) fixations.csv , and b) eye_gaze.csv . Both files were exportedby the Gazepoint Analysis UX Edition software . Specifically, the eye_gaze.csv file contains one row for every datasample collected from the eye tracker, while fixations.csv file contains a single data entry per fixation. The GazepointAnalysis UX Edition software generates the fixations.csv file from the eye_gaze.csv file by averaging all datawithin a fixation to estimate the point of fixation based on the eye gaze samples, stopping when a saccade is detected. Table 2describes in detail each column found in the fixations and eye gaze spreadsheets. Bounding Boxes Spreadsheet
The bounding boxes spreadsheet contains the following information:• dicom_id : DICOM ID as provided in MIMIC-CXR Database for each image.• bbox_name : These are the names for the 17 rectangular anatomical zones that bound the key anatomical organs on afrontal CXR image. Each lung (right and left) is bounded by its own bounding box, as well as subdivided into commonradiological zones (upper, mid and lower lung zones) on each side. The upper mediastinum and the cardiac silhouette(heart) bounding boxes make up the mediastinum anatomy. The trachea is a bounding box that includes the visibletracheal air column on a frontal CXR, as well as the beginnings of the right and left main stem bronchi. The left and righthilar structures contain the left or right main stem bronchus as well as the lymph nodes and blood vessels that enter andleave the lungs in the hilar region. The left and right costophrenic angles are key regions to assess for abnormalities on afrontal CXR. The left and right clavicles can have potential fractures to rule out, but are also important landmarks toassess whether the patient (hence the anatomies on the CXRs) are rotated or not (which affects the appearance of potentialabnormalities). Some of the bounding boxes (e.g clavicles) could be missing for an image if the target anatomicalstructure is cut off from the field of view of the CXR image.• x1 : x coordinate for starting point of bounding box (upper left).• y1 : y coordinate for starting point of bounding box (upper left).• x2 : x coordinate for ending point of bounding box (lower right).• y2 : y coordinate for ending point of bounding box (lower right).Please see Figure 6 for an example of all the anatomical bounding boxes. Audio, Segmentation Maps and Transcripts
The audio_segmentation_transcripts folder contains subfolders for all the cases in the study with case dicom_id as name. Each subfolder contains: a) the dictation audio file (mp3, wav), b) the segmentation maps of anatomies (png), asdescribed in Segmentation Maps and Bounding Boxes for Anatomies subsection above, and c) the dictation transcript (json).The dictation transcript.json contains the following tags:• full_text : The full text for the transcript. time_stamped_text : The full text broken into timestamped phrases:• phrase : Phrase text in the transcript.• begin_time : The starting time (in seconds) of dictation for a particular phrase.• end_time : The end time (in seconds) of dictation of a particular phrase.Figure 8 shows the structure of the audio_segmentation_transcripts folder, while Figure 18 shows a transcriptjson example.
Technical Validation
We subjected two aspects of the released data to reliability and quality validation: eye gaze and transcripts.The code for the validation tasks below can be found on our GitHub repository.
Validation of Eye Gaze Data
As mentioned in the Preparation of Images subsection, a calibration image was interjected randomly within the eye gazesessions to measure the error of the eye gaze on the X- and Y- axis (Figure 3). A total of 59 calibration images were presentedthroughout the data collection. We calculated the error by using the fixation coordinates of the last entry of each calibrationimage (i.e. the final resting fixation by the radiologist on the calibration mark). The overall average percentage error on X, Yaxes was calculated with (error_X, error_Y) = (0.0089 , 0.0504), and std: (0.0065, 0.0347) respectively. In pixels, the sameerror was: (error_X, error_Y) = (17.0356 , 54.3943), with std: (12.5529, 37.4257) respectively.
Validation of Transcripts
As mentioned in the Preparation of Images subsection, transcripts were generated using Google Speech-to-Text on the dictationaudio with timestamps per dictated word. The software produced two (2) types of errors:• Type A: Incorrect identification of a word at a particular time stamp (please see example in Figure 9).• Type B: Missed transcribed phrases of the dictation (please see example in Figure 10).The transcripts were manually corrected by three (3) experts and verified by the radiologist. Both types of errors werecompletely corrected. For Type B error, the missing text (i.e., more than one (1) word) was added with an estimation of the begin_time and end_time manually. To measure the potential error in the transcripts, the number of phrases with multiplewords in a single time stamp was calculated (i.e., Type B error):• Total number of phrases: 19,499• Number of phrase with single words: 18,434• Number of phrases with multiple words: 1,065Type B error = − = . Statistical Analysis on Fixations
We performed t-test analysis to measure any significant differences between fixations for each condition within anatomicalstructures. More specifically, we performed the following steps:1. We examined the average number of fixations made in each disease condition, and found that the expert made significantlymore overall fixations in the two diseased conditions than in the normal condition ( p < . fixations.csv ) fall into each anatomical zone (bounding box) found in bounding_boxes.csv .3. We performed t-test for each anatomical structure between condition pairs: i) Normal vs. Pneumonia, ii) Normal vs.CHF, iii) Pneumonia vs CHF. igure 11 shows the duration of fixations per image for each disease condition and anatomical area, while Figure 12 showsp-values from each t-test. Fixations on Normal images are significantly different from Pneumonia and CHF. More fixations aremade for images associated with either the Pneumonia of CHF final diagnoses. Moreover, fixations for the abnormal casesare mainly concentrated in anatomical regions (i.e., lungs and heart) that are relevant to the diagnosis, rather than distributedat random. Overall, the fixations on Pneumonia and CHF are comparatively similar, although still statistically different (e.g.,Left Hilar Structure, Left Lung, Cardiac Silhouette, Upper Mediastinum). These statistical differences demonstrate that theradiologist’s eye tracking information provides insight into the condition of the patient, and shows how a human expert paysattention to the relevant portions of the image when interpreting a CXR exam. The code to replicate the t-test analysis can befound on the GitHub repository. Usage Notes
The dataset is hosted on PhysioNet . The user is also required to apply for access to MIMIC-CXR Database to download theimages used in this study. Our GitHub repository provides detailed description and source code (Python scripts) on how touse this dataset (e.g. post-processing, machine learning experiments) and reproduce the published validation results. The datain the MIMIC dataset has been previously de-identified, and the institutional review boards of the Massachusetts Institute ofTechnology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of thedatabase for research. Use of the Dataset in Machine Learning
To demonstrate the effectiveness and richness of the information provided in this dataset, we performed two sets of machinelearning multi-class classification experiments leveraging the eye-gaze data. These experiments are provided as datasetapplications in simple and popular network architectures and they can function as a starting point for researchers.Both experiments used the eye gaze heatmap data to predict the multi-class classification of the aforementioned classes(i.e. Normal, CHF, Pneumonia in Table 1) and compare the performances with and without the eye gaze information. Ourevaluation metric was AUC (Area Under the ROC Curve). The first experiment deal with leveraging information from thetemporal eye gaze fixation heatmaps and the second uses static eye gaze fixation heatmaps. In contrast to temporal fixationheatmaps, static fixation heatmaps is the aggregation of all the temporal fixations into a single image.
Temporal Heatmaps Experiment
The first model consists of a neural architecture, where the image and the temporal fixation heatmaps representations areconcatenated before the final prediction layer. We denote an instance of this dataset as X ( i ) , which includes the image X ( i ) CRX andthe sequence of m temporal fixation heatmaps X ( i ) eyegaze = { X ( i ) j } mk = , where k ∈ { , ..., m } is the temporal heatmap index. Toacquire a fixed vector CRX representation v ( i ) CRX , the image is passed through a convolutional layer with 64 filters of kernel size7 and stride 2, followed by max-pooling, batch normalization and a dense layer of 128 units. The baseline model consists of theaforementioned image representation layer, combined with a final linear output layer that produces the classification prediction.Additionally, for the eye gaze, each heatmap is passed through a similar convolutional encoder and then the sequence ofheatmaps is summarized with a 1-layer bidirectional LSTM with self-attention . We denote the heatmap representation as u ( i ) eyegaze . Here, the image and heatmaps representations are concatenated before passed through the final classification layer.Figure 13 shows the full architecture. We train with Adam , 0 .
001 initial learning rate and triangular schedule with fixeddecay , 16 batch size and 0 . . The experimental results in Figure 14 show that incorporating eye gaze temporalinformation, without any preprocessing, filtering or feature engineering, results in 4% AUC improvement for this predictiontask, when compared to the baseline model with just CXR image data as input. Static Heatmaps Experiment
The previous section showed the use of temporal fixation heatmaps with improvements demonstrated on a simple networkarchitecture over baseline. In this experiment, we pose the classification problem in the U-Net architecture framework withan additional multi-class classification block at the bottleneck layer (see Figure 15). The encoding and bottleneck arm of theU-Net can be any standard pre-trained classifier without the fully connected layer. The two combined will act as a featureencoder for the classifier and the CNN decoder part of the network runs deconvolution layers to predict the static eye gazefixation heatmaps. The advantage is that we can jointly train to output the eye-gaze static fixation heatmap as well as predict themulti-class classification. Then, during testing on unseen CXR images, the network can predict the disease class and produce aprobability heatmap of the most important locations pertaining to the condition.We used a pretrained EfficientNet-b0 as the encoder and bottleneck layers. The classification head was an adaptiveaverage pooling followed by flatten, dropout and linear output layers. The decoder CNN consisted of three convolution ollowed by upsampling layers. The loss function was a weighted combination ( γ ) of the classification and the segmentationlosses both of which used a binary cross entropy loss function. The baseline network consisted of just the encoder and thebottleneck arm followed by the classification head.The hyper-parameter tuning for both the U-Net and the baseline classifier was performed using the Tune library andthe resulting best performing tune is shown in Table 3. Figure 16 shows the U-Net and baseline AUCs. Both had similarperformance. However, for this experiment, we are interested in seeing how network interpretability improved with the use ofstatic eye gaze heatmaps. Figure 17 shows a qualitative comparison of the Grad-CAM . The Grad-CAM approach is one of thecommon methods to visualize activation maps of convolutional networks. While the Grad-CAM based heatmaps don’t clearlyhighlight the disease locations, we see clearly that the heatmap probability outputs of the U-Net highlights similar regions towhat the static eye-gaze heatmap shows.With both experiments we tried to demonstrate different use cases of the eye gaze data into machine learning. With thefirst experiment we wanted to show how eye gaze data can be utilized in a human-machine setting where radiologist’s eyegaze information are fed into the algorithm. The second experiment shows how eye eye gaze information can be used towardsexplainability purposes through generating verified activation maps. We intentionally did not include other modalities (audio,text) because of the complexity of such experiments and the scope of this paper (i.e., dataset description). We hope that theseexperiments can serve as a starting point for researchers to explore novel ways to utilize this multi-modal dataset. Limitations of study
Although this study provides a unique large research dataset, we acknowledge the following limitations:1. The study was performed with a single radiologist. This can certainly bias the dataset (lacks inter-observer variability)and we aim to expand the data collection with multiple radiologists in the future. However, given the relatively largesize and richness of data from various sources (i.e. multi-modal), we believe that the current dataset already holds greatvalue to the research community. In addition, we have shown with preliminary machine learning experiments that amodel trained to optimize on a radiologist’s eye tracking pattern has improved diagnostic performance as compared to abaseline model trained with weak image-level labels.2. The images used during the radiology reading were in ’png’ format and not in DICOM. That’s because the GazepointAnalysis UX Edition doesn’t support DICOM format. This had the shortcoming that the radiologist could not utilizewindowing techniques. However the png images were prepared using the windowing information in the original DICOMimages.3. This dataset includes only Posterior Anterior (PA) CXR images as selected from the inclusion/exclusion criteria (Figure2). This view position criterion was clinically chosen because of its higher quality images compared to Anterior Posterior(AP) CXR images. Therefore, any analysis (e.g., machine learning models trained on only this dataset) may suffer fromgeneralizability to AP CXR images. Code availability
Our GitHub repository contains code (Python 3) for:1. Data Preparation• Inclusion and exclusion criteria on MIMIC dataset (see details in Inclusion and Exclusion Criteria section)• Case sampling and image preparation for eye gaze experiment (see details in Preparation of Images section)2. Data Post -Processing• Speech-to-text on dictation audio (see details in Audio Extraction and Transcript Generation section)• Mapping of eye gaze coordinates to original image coordinates (see details in Fixations and Eye Gaze Spreadsheetssection)• Generate heatmap images (i.e temporal or static) and videos given eye gaze coordinates. The temporal and staticheatmap images were used in our demonstrations of machine learning methods in Use of the Dataset in MachineLearning section. . Technical Validation• Validation of eye gaze fixation quality using calibration images (see details in Validation of Eye Gaze Data)• Validation of quality in transcribed dictations (see details in Validation of Transcripts section)• The t-test for eye gaze fixations for each anatomical structure and condition pairs (see details in Statistical Analysison Fixations section)4. Machine Learning Experiments, as described in Use of the Dataset in Machine Learning sectionSoftware requirements are listed in the GitHub repository
References Irvin, J. et al.
Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In
Proceedings ofthe AAAI Conference on Artificial Intelligence, volume=33, pages=590–597, year=2019 . Johnson, A. E. et al.
MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.
Sci. Data , https://doi.org/10.13026/C2JT1Q (2019). Bluemke, D. A. et al.
Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, andReaders-From the Radiology Editorial Board.
Radiology , 487–489 (2020). Waite, S. A. et al.
Analysis of perceptual expertise in radiology–Current knowledge and a new perspective.
Front. humanneuroscience , 213 (2019). Van der Gijp, A. et al.
How visual search relates to visual diagnostic performance: a narrative systematic review ofeye-tracking research in radiology.
Adv. Heal. Sci. Educ. , 765–787 (2017). Krupinski, E. A. Current perspectives in medical image perception.
Attention, Perception, & Psychophys. , 1205–1217(2010). Tourassi, G., Voisin, S., Paquit, V. & Krupinski, E. Investigating the link between radiologists’ gaze, diagnostic decision,and image content.
J. Am. Med. Informatics Assoc. , 1067–1075 (2013). Khosravan, N. et al.
A collaborative computer aided diagnosis (C-CAD) system with eye-tracking, sparse attentionalmodel, and deep learning.
Med. image analysis , 101–115 (2019). Stember, J. N. et al.
Eye Tracking for Deep Learning Segmentation Using Convolutional Neural Networks.
J. digitalimaging , 597–604 (2019). Aresta, G. et al.
Automatic lung nodule detection combined with gaze information improves radiologists’ screeningperformance.
IEEE J. Biomed. Heal. Informatics (2020).
Mall, S., Brennan, P. C. & Mello-Thoms, C. Modeling visual search behavior of breast radiologists using a deep convolutionneural network.
J. Med. Imaging , 035502 (2018). Goldberger, A. L. et al.
PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complexphysiologic signals. circulation , e215–e220 (2000).
Johnson, A. et al.
Mimic-iv, 10.13026/A3WN-HQ05 (2020).
Gazepoint. GP3 Eye Tracker.
Gazepoint. Gazepoint Analysis UX Edition.
Wu, J. T. et al.
AI Accelerated Human-in-the-loop Structuring of Radiology Reports. In
AMIA (2020).
Karargyris, A. et al.
Age prediction using a large chest x-ray dataset. In Mori, K. & Hahn, H. K. (eds.)
Medical Imaging2019: Computer-Aided Diagnosis , vol. 10950, 468 – 476, 10.1117/12.2512922. International Society for Optics andPhotonics (SPIE, 2019).
Wu, J. et al.
Automatic bounding box annotation of chest x-ray data for localization of abnormalities. In , 799–803 (2020).
Karargyris, A. et al.
Eye gaze data for chest x-rays, 10.13026/QFDZ-ZR67 (2020).
Cheng, J., Dong, L. & Lapata, M. Long Short-Term Memory-Networks for Machine Reading. In
Proceedings of the 2016Conference on Empirical Methods in Natural Language Processing , 551–561 (2016). Vaswani, A. et al.
Attention is all you need. In
Advances in neural information processing systems , 5998–6008 (2017).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization.
Smith, L. N. Cyclical learning rates for training neural networks. In , 464–472 (IEEE, 2017).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neuralnetworks from overfitting.
The journal machine learning research , 1929–1958 (2014). Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In
InternationalConference on Medical image computing and computer-assisted intervention , 234–241 (Springer, 2015).
Tan, M. & Le, Q. V. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprintarXiv:1905.11946 (2019).
Liaw, R. et al.
Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 (2018).
Selvaraju, R. R. et al.
Grad-cam: Visual explanations from deep networks via gradient-based localization. In
Proceedingsof the IEEE international conference on computer vision , 618–626 (2017).
Author contributions statement
A.K. was responsible for conceptualization, execution and management of this project.S.K. contributed to the conceptualization and was responsible for the design and execution of the use case experiments, the dataand code open sourcing and preparation of the manuscript.I.L. designed the neural architecture that utilizes the temporal fixation heat maps, conducted the respective machine learningexperiments, created the hyper-parameter tuning pipeline and edited the manuscript.J.W. was responsible for the clinical design of the dataset, data sampling and documentation for the project, development of thebounding box pipeline and organizing the validation of results, as well as writing and editing the manuscript.A.S. was responsible for interpretation of images with eye tracking, experimental design, and editing of the manuscript.M.T. contributed prior experience with eye tracking studies, equipment, and methodologies, and assisted with the analysis.S.A. contributed to the conceptualization and was responsible for data collection and processingD.B. contributed to the original team formation and consulted on eye gaze tracking technology and data collection.V.M. contributed to the initial concept review and securing funding for equipment/capital and execution of the project.E.K. contributed to the clinical design of the study.M.M. contributed to the design of experiments, supervision of the work and writing of the manuscript.All authors reviewed the manuscript.
Competing interests
The authors declare no conflicts of interest.
Figures & Tables
Table 1.
Online-only Master SpreadsheetColumn Name Descriptiondicom-id DICOM ID in the original MIMIC dataset path Path of DICOM image in the original MIMIC datasetstudy-id Study ID in the original MIMIC datasetpatient-id Patient ID in the original MIMIC datasetstay-id Stay ID in the original MIMIC datasetgender Gender of patient in the original MIMIC datasetanchor-age Age range in years of patient in the original MIMIC datasetimage-top-pad, image-bottom-pad, image-left-pad,image-right-pad Padding (top, bottom, left, right respectively) in pixels applied afterre-scaling MIMIC image to 1920x1080 ormal-reports No affirmed abnormal finding labels or descriptors documented in theoriginal MIMIC-CXR reports, extracted using an internal CXR labelingpipeline .Normal No abnormal chest related final diagnosis from the Emergency Depart-ment (ED) discharge ICD-9 records AND have normal-reports as definedabove.CHF A clinical diagnosis of heart failure (includes ICD-9 for congestive heartfailure, chronic or acute on chronic heart failure) from the ED visit asdetermined from the associated ICD-9 discharge diagnostic code.Pneumonia A clinical diagnosis of any lung infection (pneumonia) including bacte-rial and viral, as determined from the ICD-9 discharge diagnosis code ofthe ED visit.dx1, dx2, dx3, dx4, dx5, dx6,dx7, dx8, dx9 The descriptive ICD-9 diagnosis name associated with the EmergencyRoom admission for which the CXR study was ordered. ICD-9 finaldiagnoses are used to identify the 3 classes in the eye-gaze analysis andexperiments.dx1-icd, dx2-icd, dx3-icd,dx4-icd, dx5-icd, dx6-icd,dx7-icd, dx8-icd, dx9-icd ICD-9 code for corresponding dxconsolidation, enlarged-cardiac-silhouette,linear-patchy-atelectasis,lobar-segmental-collapse,not-otherwise-specified-opacity,pleural-parenchymal-opacity,pleural-effusion-or-thickening, pulmonary-edema-hazy-opacity,normal-anatomically,elevated-hemidiaphragm,hyperaeration, vascular-redistribution Abnormal finding labels derived from the original MIMIC-CXR reportsby an internal IBM CXR report labeler . 0: Negative, 1: Positiveatelectasis-chx,cardiomegaly-chxconsolidation-chx,edema-chx,enlarged-cardiomediastinum-chxfracture-chx,lung-lesion-chx,lung-opacity-chx,no-finding-chx,pleural-effusion-chx,pleural-other-chx,pneumonia-chx,pneumothorax-chx,support-devices-chx ChexPert report derived abnormal finding labels for MIMIC-CXR. 0:negative, 1: positive, -1: uncertaincxr_exam_indication The reason for exam sentences sectioned out from Indication section ofthe original MIMIC-CXR reports . They briefly summarizes patientsimmediate clinical symptoms, prior medical conditions and or recentprocedures that are relevant for interpreting the CXR study within theclinical context. able 2. Online-only Fixations and Eye Gaze SpreadsheetsData Type/Column Name DescriptionDICOM-ID DICOM ID from original MIMIC dataset.CNT The counter data variable is incremented by 1 for each data record sentby the server. Useful to determine if any data packets are missed by theclient.TIME(in secs) The time elapsed in seconds since the last system initialization or cal-ibration. The time stamp is recorded at the end of the transmission ofthe image from camera to computer. Useful for synchronization andto determine if the server computer is processing the images at the fullframe rate. For a 60 Hz camera, the TIME value should increment by1/60 seconds.TIMETICK(f=10000000) This is a signed 64-bit integer which indicates the number of CPU timeticks for high precision synchronization with other data collected on thesame CPU.FPOGX The X- coordinates of the fixation POG, as a fraction of the screen size.(0,0) is top left, (0.5,0.5) is the screen center, and (1.0,1.0) is bottomright.FPOGY The Y-coordinates of the fixation POG, as a fraction of the screen size.(0,0) is top left, (0.5,0.5) is the screen center, and (1.0,1.0) is bottomright.FPOGS The starting time of the fixation POG in seconds since the system initial-ization or calibration.FPOGD The duration of the fixation POG in secondsFPOGID The fixation POG ID numberFPOGV The valid flag with value of 1 (TRUE) if the fixation POG data is valid,and 0 (FALSE) if it is not. FPOGV valid is TRUE ONLY when eitherone, or both, of the eyes are detected AND a fixation is detected. FPOGVis FALSE all other times, for example when the subject blinks, whenthere is no face in the field of view, when the eyes move to the nextfixation (i.e. a saccade)BPOGX The X-coordinates of the best eye POG, as a fraction of the screen size.BPOGY The Y-coordinates of the best eye POG, as a fraction of the screen size.BPOGV The valid flag with value of 1 if the data is valid, and 0 if it is not.LPCX The X-coordinates of the left eye pupil in the camera image, as a fractionof the camera image size.LPCY The Y-coordinates of the left eye pupil in the camera image, as a fractionof the camera image size.LPD The diameter of the left eye pupil in pixelsLPS The scale factor of the left eye pupil (unitless). Value equals 1 at cali-bration depth, is less than 1 when user is closer to the eye tracker andgreater than 1 when user is further away.LPV The valid flag with value of 1 if the data is valid, and 0 if it is not.RPCX The X-coordinates of the right eye pupil in the camera image, as afraction of the camera image size.RPCY The Y-coordinates of the right eye pupil in the camera image, as afraction of the camera image size.RPD The diameter of the right eye pupil in pixelsRPS The scale factor of the right eye pupil (unitless). Value equals 1 atcalibration depth, is less than 1 when user is closer to the eye tracker andgreater than 1 when user is further away.RPV The valid flag with value of 1 if the data is valid, and 0 if it is not.
KID Each blink is assigned an ID value and incremented by one. The BKIDvalue equals 0 for every record where no blink has been detected.BKDUR The duration of the preceding blink in seconds.BKPMIN The number of blinks in the previous 60 second period of time.LPMM The diameter of the left eye pupil in millimeters.LPMMV The valid flag with value of 1 if the data is valid, and 0 if it is not.RPMM The diameter of the right eye pupil in millimeters.RPMMV The valid flag with value of 1 if the data is valid, and 0 if it is not.SACCADE-MAG Magnitude of the saccade calculated as distance between each fixation(in pixels).SACCADE-DIR The direction or angle between each fixation (in degrees from horizontal).X_ORIGINAL The X coordinate of the fixation in original DICOM image.Y_ORIGINAL The Y coordinate of the fixation in original DICOM image.
Figure 1.
Flowchart of Study
Table 3.
Best performing hyper-parameters used for the static heatmap experiments found using the Tune library. ExperimentName Optimizer InitialLearning rate Scheduler Step Size Epochs Dropout γ UNet Adam igure 2. Sampling flowchart for selecting images for this study from the MIMIC-IV (the ED subset) and the MIMIC-CXRdatasets. igure 3.
Left: Calibration image presented to radiologist during data collection, Right: Radiologist’s fixation over-imposedin red.
Figure 4.
Sample video exported from Gazepoint Analysis UX Edition showing a CXR case image with overlayed fixations. Figure 5.
From Left to Right: CXR image, Left lung, Right lung, Aortic knob and Mediastinum.
Figure 6.
Sample CXR case with 17 overlaying anatomical bounding boxes. The anatomies in the chest overlay one anotheron CXRs since the image is the 2D X-ray shadow capture of a 3D object. igure 7.
Overview of Dataset
Figure 8. audio_segmentation_transcripts folder structure
Figure 9.
Left: Example of incorrect detection. Right: Manual correction igure 10.
Left: Missed and incorrect transcript phrase. Right: Manually corrected phrase
Figure 11.
Fixations vs. anatomical structures vs. conditions
Figure 12. p-values (at 3 decimal places) for each pair of condition and anatomy. p-values highlighted with red demonstratestatistical significant differences igure 13.
Model architecture for leveraging temporal eyegaze information.
Figure 14.
Experimental results with (right) and without (left) temporal eyegaze information. igure 15.
Block diagram of U-Net utilizing the static heatmap combined with a classification head
Figure 16.
AUC results with U-Net (right) and baseline classifier (left) static eyegaze information.
CHF. The physician’s eye-gaze tends to fall on the enlarged heart and hila, as well as generally on the lungs (b)
Pneumonia. The physician’s eye-gaze predictably focuses on the focal lung opacity (c)
Normal. Because the lungs are clear, the physician’s eye-gaze skips around the image without focus
Figure 17.
Qualitative comparison of the interpretabiltiy of U-Net based probability maps in comparison with GradCAM.From Left to Right: CXR image, GradCAM from Baseline Model, GradCAM from U-Net Encoder, Static EyeGaze Heatmapand U-Net Prob Map. { "full_text" : " prominent heart. cardiomegaly in fact. suspect small left effusion with atelectasis. right lung appears clear.", "time_stamped_text" : [ { "begin_tim" : 1.0, "end_time" : 1.7000000000000002, "phrase" : "prominent" }, { "begin_tim" : 1.7000000000000002, "end_time" : 2.0, "phrase" : "heart." }, { "begin_tim" : 3.0, "end_time" : 3.8, "phrase" : "cardiomegaly" }, { "begin_tim" : 3.8, "end_time" : 4.2, "phrase" : "in fact." }, { "begin_tim" : 6.3, "end_time" : 7.1, "phrase" : "suspect" }, { "begin_tim" : 8.2, "end_time" : 8.9, "phrase" : "small" }, { "begin_tim" : 8.9, "end_time" : 9.0, "phrase" : "left" }, { "begin_tim" : 9.0, "end_time" : 9.4, "phrase" : "effusion" }, { "begin_tim" : 10.7, "end_time" : 11.2, "phrase" : "with" }, { "begin_tim" : 11.2, "end_time" : 11.6, "phrase" : "atelectasis." }, { "begin_tim" : 17.9, "end_time" : 18.3, "phrase" : "right" }, { "begin_tim" : 18.3, "end_time" : 18.5, "phrase" : "lung" }, { "begin_tim" : 18.5, "end_time" : 18.6, "phrase" : "appears" }, { "begin_tim" : 18.6, "end_time" : 19.0, "phrase" : "clear." } ] } Figure 18.