RTEX: A novel methodology for Ranking, Tagging, and Explanatory diagnostic captioning of radiography exams
Vasiliki Kougia, John Pavlopoulos, Panagiotis Papapetrou, Max Gordon
RRTE X : A novel methodology for Ranking, Tagging, andExplanatory diagnostic captioning of radiography exams Vasiliki Kougia and John Pavlopoulos* andPanagiotis Papapetrou
Stockholm University, Sweden(kougia.vasiliki,ioannis,panagiotis)@dsv.su.se
Max Gordon
Karolinska Institutet, [email protected]
ABSTRACT
This paper introduces RTE X , a novel methodology for a) rankingradiography exams based on their probability to contain an abnor-mality, b) generating abnormality tags for abnormal exams, and c)providing a diagnostic explanation in natural language for each ab-normal exam. The task of ranking radiography exams is an importantfirst step for practitioners who want to identify and prioritize thoseradiography exams that are more likely to contain abnormalities,for example, to avoid mistakes due to tiredness or to manage heavyworkload (e.g., during a pandemic). We used two publicly availabledatasets to assess our methodology and demonstrate that for the taskof ranking it outperforms its competitors in terms of ndcд @ k . Foreach abnormal radiography exam RTE X generates a set of abnormal-ity tags alongside an explanatory diagnostic text to explain the tagsand guide the medical expert. Our tagging component outperformstwo strong competitor methods in terms of F . Moreover, the diag-nostic captioning component of RTE X , which exploits the alreadyextracted tags to constrain the captioning process, outperforms allcompetitors with respect to clinical precision and recall. KEYWORDS
Medical Imaging, Diagnostic Image Captioning, Captioning, Ex-plainability.
ACM Reference Format:
Vasiliki Kougia and John Pavlopoulos* and Panagiotis Papapetrou and MaxGordon. 2020. RTE X : A novel methodology for Ranking, Tagging, andExplanatory diagnostic captioning of radiography exams. In Proceedingsof ACM Conference (Conference’17).
ACM, New York, NY, USA, 8 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn
Medical imaging is the method of forming visual representationsof the anatomy or a function of the human body using a varietyof imaging modalities (e.g., CR, CT, MRI) [1, 41]. In this paper,we particularly focus on chest radiography exams, which containmedical images produced by X-Rays. It is estimated that over threebillion radiography exams are performed annually worldwide [22],
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Figure 1: A PA/lateral chest radiography exam along with thecorresponding human-authored
DIAGNOSTIC TEXT from IU X-Ray, and the abnormality tags. The ‘XXXX’ is due to the de-identification process. making the daily need for processing and interpretation of the pro-duced radiographs paramount. An example of a radiography examis provided in Fig. 1, that consists of two chest radiographs togetherwith the diagnostic text, describing the medical observations on theradiographs, and a list of abnormality tags indicating most criticalobservations in the exam. In the diagnostic text, we observe that twofindings are normal (i.e., cardiac contours and lungs), while threeare abnormal, i.e., thoracic spondylosis , lower cervical arthritis , and basilar atelectasis . These abnormal findings are also consistent withthe abnormality tags.Our main objective in this paper is to introduce a novel method-ology for automated and explainable Diagnostic Tagging of a col-lection of radiography exams, each comprising several radiographs,that can (1) accurately rank exams with abnormalities included inthe radiographs, (2) automatically provide tags corresponding tothe medical findings of the abnormal exams, (3) produce a diag-nostic text describing the abnormality findings by exploiting bothradiographs and the generated tags. Despite the importance of thisproblem, existing solutions are hindered by three major challenges.
Challenge I - Screening and prioritization.
The daily routine ofdiagnostic radiologists includes the examination of radiographs,i.e., medical images produced by X-Rays, for abnormalities or otherfindings, and an explanation of these findings in the form of a medical a r X i v : . [ c s . C V ] J un onference’17, July 2017, Washington, DC, USA Vasiliki Kougia and John Pavlopoulos* and Panagiotis Papapetrou and Max Gordon report per radiography exam [30]. This is a rather challenging andtime-consuming task imposing a high burden both to radiologists andpatients. For example, approximately 230,000 of patients in Englandare waiting for over a month for their imaging test results [32], while71% of the clinics in the U.K. report a lack of clinical radiologists[33]. While several methods have emerged that automatically detectabnormalities in radiographs [36] or generate a diagnostic text [15,24, 26], little emphasis has been given on case prioritization andscreening. There is hence a need for a new diagnostic approach thatcan automatically screen radiography exams with abnormalities and prioritize those with higher probability of containing an abnormality. Challenge II - Clinically correct diagnostic captioning.
Methodsthat can automatically generate (or retrieve) diagnostic text can beused to assist inexperienced physicians while they can also yielda draft to speed up the authoring process [21, 24, 26]. However,to the best of our knowledge, all the diagnostic captioning modelssuggested in the literature are not optimized in terms of clinicalcorrectness, mainly because they are trained on both normal and ab-normal radiography exams. This makes them less effective comparedto being trained only on abnormal exams, as we also demonstratein Sec. 4.3.3. There is hence a need for a diagnostic captioningapproach that is optimized for captioning abnormal radiographs . Challenge III - Explainability and clinical relevance are oftenprovided in the form of visual highlights (e.g., heatmaps) alongsidediagnostic tags [17]. Nonetheless, system-generated visual explana-tions only function as means for highlighting image parts relevantto the diagnostic tags, without any textual explanation. On the otherhand, Diagnostic Captioning methods can provide both a diagnosisand an explanation for the problem at hand, since they provide awhole text instead of a tag or a label. Nonetheless, the producedreports are typically of low clinical correctness, as they are not par-ticularly optimized in terms of clinical relevance [4]. The abovedeficiencies could be addressed by a diagnostic tagging approachthat first produces tags for abnormal radiographs, and then employsthe generated tags for providing clinically relevant explanations inthe form of diagnostic text . Contributions.
This paper addresses the aforementioned challengeswith the main contributions summarized as follows: • Novelty.
We introduce RTE X , a novel methodology for ex-plainable diagnostic tagging of radiography exams, that ad-dresses the aforementioned challenges with the help of threekey functionalities: (1) Ranking of abnormal radiographyexams: a ranking approach is employed for prioritizing ex-ams likelier to include an abnormality from a large collectionof normal and abnormal radiography exams; (2)
Diagnostictagging: a tag generator is employed for generating a set ofabnormality tags for the highly ranked radiography exams,trained on an independent set of abnormal radiographs; (3)
Diagnostic captioning: the extracted tags are finally usedby RTE X to generate (or retrieve) a diagnostic text, in naturallanguage, that provides a clinically relevant explanation ofthe detected abnormal findings. Explicitly required by EU’s General Data Protection Regulation (GDRP, Art. 13¢g2.f:gdpr.eu/article-13-personal-data-collected/ • Applicability and efficiency.
We provide an empirical eval-uation of the proposed methodology, using two publicly avail-able datasets of radiography exams [6, 16]. Our experimentalbenchmarks assess the performance of RTE X on the abilityto (a) rank abnormal radiography exams higher than normalones, (b) produce the correct medical abnormality tags forabnormal radiography exams, and (c) explain the reasoningbehind the selection of the detected tags in the form of diag-nostic text. Moreover, a runtime experiment demonstrates thetime efficiency of RTE X , showing that it requires only 19.78seconds to rank 500 radiography exams, and 19.43 secondsfor tagging and diagnostic captioning of the top-100 rankedexams. • Effectiveness and clinical accuracy.
Our experiments demon-strate the effectiveness of RTE X against state-of-the-art com-petitors for the tasks of ranking and tagging. Our findingsadditionally suggest that diagnostic captioning using the tagsproduced by RTE X can provide more clinically accurate di-agnostic text compared to not using the generated tags.The remainder of this paper is organized as follows: in Sec. 2 weoutline the related work, while in Sec. 3 we describe RTE X . InSec. 4 we introduce the datasets used for our empirical evaluation,we provide the experimental setup and report our results. Finally,Sec. 5 concludes the paper and provides directions for future work. In this section, we outline the main body of related work on medicalimage ranking, medical image tagging, and diagnostic captioning. Tothe best of our knowledge, while many earlier works have targetedthese problems individually, there is yet no comprehensive method-ology for combining these three tasks with focus on radiographyexams that contain abnormalities.Automated screening of radiography exams is not a novel idea[14, 34, 42]. When the number of exams is overwhelming, as forexample during a pandemic, the employment of an automated systemto exclude normal cases can lead to faster treatment of abnormalcases. Recently, pre-trained deep learning models, such as DenseNet-121 [12] and VGG-19 [39], were found to discriminate well normalcases from ones with pneumonia and COVID-19 (90% Precisionand 83% Recall) [17]. The authors noted that their approach aimsto ease the work of radiologists and such an assistance scenario issuggested in this work. In our solution, we also employ DenseNet-121 CNN for multi-label classification, which is considered to be thestate-of-the-art [3].Researchers have focused on labeling radiography exams thatare associated with a single abnormality finding; e.g., lymph node[38] or end- diastole/systole frames in cine-MRI of the heart [19].This means that an assumption is made that the problem is a prioriknown (e.g., abnormality related to
LYMPH NODE ). This is not al-ways the case, for example when the radiographs of a new patientarrive for the first time to the clinic. Another line of research, that ofexploring multiple abnormality types, has been focusing on associ-ating medical tags (a.k.a. concepts) to radiographs which is relatedto content-based image retrieval (CBIR). Liu et al. [27] trained acustom CNN to classify radiographs in 193 classes and obtained adescriptive feature vector to be further processed and used for image TE X : A novel methodology for Ranking, Tagging, and Explanatory diagnostic captioning of radiography exams Conference’17, July 2017, Washington, DC, USA retrieval. Their approach was found to be more accurate than manysubmissions to an earlier CLEF medical image annotation challenge,but it was also inferior than the state of the art. A similar medi-cal image annotation challenge still exists today (ICLEF CAPTION )with tens of submissions each year [36]. Participating systems wereasked to tag medical images extracted from open access biomedicaljournal articles of PubMed Central, where the tags were automat-ically extracted from each figure caption using QuickUMLS [40].Not very highly ranked systems used engineered visual features(Scale-Invariant Feature Transform) to encode the images (26th/49),while systems using CNNs to encode the images were better placed.The 4th best system was a ResNet-101 CNN followed by an atten-tional RNN multi-label image classifier. The 3rd best system was aDenseNet-121 CNN encoder followed by a K-NN image retrievalsystem, while the 1st place was awarded to a DenseNet-121 CNNfollowed by a Feed Forward Neural Network classifier. This workbuilds on top of the two best performing systems . Diagnostic Captioning has not yet been investigated in the lit-erature as an explainability step of diagnostic tagging. While Galeet al. [8] suggested the use of captioning as an explanation step, theymanually assembled sentence templates for systems to learn to fill.A dataset that comprised medical images and texts was introducedfor a challenge [5, 7], but it was very noisy (images were figuresextracted from scientific articles and the gold reports were theircaptions) [21].
Diagnostic Captioning methods are usually Encoder-Decoders [2, 11, 28], which often originate from
Generic ImageCaptioning . Although different variations have been suggested inthe literature [15, 23, 26, 46], most of these methods extend thevery well-known Show & Tell (S&T) model [43] with hierarchi-cal decoding [15], elaborate visual attention mechanisms [44], orreinforcement learning [23]. S&T comprises a CNN to encode theimage and uses the visual representation to initialize a decodingLSTM. We employed this model to generate diagnostic text, hav-ing extended it to also encode the tags along with the image. Liet al. [23] employ an Encoder-Decoder approach to either generateor retrieve the diagnostic text from a medical image. Their hybridapproach initially uses a DenseNet [12] or a VGG-19 [39] CNN toencode the image. The encoded image is used through an attentionmechanism [29, 45] in a stacked RNN that generates sentence em-beddings, each of which is used along with the encoded image byanother word-decoding RNN to generate the words of the sentence.Each sentence embedding is provided as input to a Feed ForwardNeural Network (FFNN) which outputs a probability distributionover a number of fixed sentences and a word decoder. If a fixedsentence has the highest probability, then this sentence is retrievedas the next sentence instead of using the word-decoding RNN. Forthe explanation stage of our methodology, we also experiment withCNN-RNN Encoder-Decoder methods but mainly to explain theextracted diagnostic tags, while the Encoder-Decoders are trainedonly on abnormal studies, which makes sentence retrieval redundant.
We present RTE X , a three-stage novel methodology for rankingand explainable diagnostic tagging of radiography exams, with an The 2nd place was awarded to an ensemble of the two best performing systems. overview of the whole pipeline depicted in Fig. 2. First, we pro-vide the problem formulation presenting the three sub-problems,addressed by each stage of RTE X . Let S = { S , . . . , S n } be a set of n radiography exams, where eachexam S i ∈ S is a set of radiographs, i.e., S i = { M i , . . . , M im } .In our target application, we have m = , that is, each S i is a pairof radiographs (one frontal and one lateral). Our formulation andapproach can, however, be generalized to contain an arbitrary numberof radiographs m .Assume an alphabet of abnormality tags A . Each radiographyexam S i is assigned with a set of labels L i ∈ A , either listing theabnormalities that are detected in the image or returning an emptylist indicating that the image contains no abnormalities.Based on the above, the first objective of RTE X can be formulatedas follows:P ROBLEM (radiography exam ranking) Given a set of radio-graphy exams S , a ranking function r (·) and an integer k , identify theset H k of the top k abnormal exams in S such that r (·) is maximized. Next, given the retrieved set H k of the top k exams, our goal isto produce a set of abnormality tags. This brings us to the secondobjective of RTE X , which can be formulated as follows:P ROBLEM (abnormal radiography exam diagnostic tagging) Given a set of abnormal radiography exams H k , produce a set ofabnormality tags T , with each tag originating from set A . Each setof tags T j ∈ T describes each exam S j ∈ H k . In other words, all images contained in a single radiography exam( S j ) are described by a common set of abnormality tags ( T j ).Eventually, given the set of produced tags, our final goal is toobtain a diagnostic caption explaining the abnormalities shown inthe images contained in the radiography exam, and referenced by theextracted tags. More formally, RTE X ’s third objective is formulatedas follows:P ROBLEM (abnormal radiography exam diagnostic caption-ing) Given a set of abnormal radiography exams H k and a set oftags T describing the abnormalities in each exam, provide a set ofcaptions C , where each caption C j ∈ C describes radiography exam S j ∈ H k . X The three stages of RTE X are outlined in Alg. 1. Next, we providemore details for each stage. RTEx@R : Ranking.
For the first stage in our methodologywe implement an architecture which we refer to as RTE X @R, shownin Fig. 3. More concretely, we employ the same visual encoder as in[37]. That is the DenseNet-121 CNN, which is followed by a FeedForward Neural Network (FFNN). The input of the network areimages of radiography exams while the output is a score representingthe probability that the exam in question is abnormal . First, bothimages of the exam are fed to DenseNet-121 (depicted inside the boxin the center) and an embedding for each image is extracted from itslast average pooling layer. These embeddings are concatenated to onference’17, July 2017, Washington, DC, USA Vasiliki Kougia and John Pavlopoulos* and Panagiotis Papapetrou and Max Gordon Figure 2: A depiction of our RTE X methodology. First, it ranks the radiography exams based on their probability (i.e., using theradiographs of each exam) to include an abnormality. The highest ranked are tagged with abnormality terms and an explanatorydiagnostic text is automatically provided to assist the expert.Algorithm 1: Outline of the RTE X methodology Data: a set of radiography exams S and the number k of examsto retrieve. Result: a set T of abnormality tags and a set C of captions. // define a list to maintain the score of each radiography exam; scores = {} ; // apply the RTE X ranking function; for S i ∈ S do scores i = RTE X @R ( S i ) ; // sort S with respect to their scores in descending order; S ′ = sort (S , scores , “ descend ” ) ; // filter the top k abnormal exams; H k = f ilter (S ′ , k ) ; C , T ← {} ; for S j ∈ H k do // apply the RTE X tagging function; T j = RTE X @T ( S j ) ; // apply the RTE X captioning function; C j = RTE X @X ( S j , T j ) ; return {T , C} yield a single embedding for the radiography exam. Then, the examembedding is passed to a FFNN with a siдmoid to return a scorefrom 0 (normal) to 1 (abnormal). RTEx@T : Diagnostic tagging.
The second stage of ourmethodology, referred to as RTE X @T comprises the assignmentof a set of tags T j to a radiography exam S j ∈ H h . Our method foraddressing this task is called RTE X @T and shown in Fig. 4. It issimilar to RTE X @R in that it uses the DenseNet-121 CNN encoderand a FFNN. But it differs in that the FFNN has one output andone sigmoid activation per abnormality tag in the dataset, leadingto A different output nodes (the right most arrows in the figure).In effect, it returns a probability distribution over the abnormalitytags and if the probability of an abnormality tag (i.e., its respectivenode) exceeds a learned threshold, then the tag is assigned to theradiography exam. RTEx@X : Diagnostic captioning.
For the last stage of ourmethodology (the rightmost part of Fig. 2), referred to as RTE X @X, Figure 3: The architecture of RTE X @R. The input is a radio-graphy exam and the output is a probability of the exam to beabnormal.Figure 4: The architecture of RTE X @T, which is similar toRTE X @R, but the input is an abnormal radiography exam andthe output consists of A binary nodes, where A is the total num-ber of tags in the dataset. The nodes that yield probabilitieshigher than a defined threshold, indicate the presence of therespective medical abnormalities. we use a method that comprises a DenseNet-121 CNN encoder,calibrated for the task of diagnostic captioning. More specifically,each radiography exam in the database is encoded (offline) by ourCNN to an embedding (i.e., two image embeddings extracted fromthe last average pooling layer of the encoder, concatenated). OurCNN also encodes any new test exam. Then, the cosine similaritybetween the test embedding and all the training embeddings inthe database are calculated and the most similar exam is retrieved TE X : A novel methodology for Ranking, Tagging, and Explanatory diagnostic captioning of radiography exams Conference’17, July 2017, Washington, DC, USA from the database. Its diagnostic text is then assigned to the testexam. RTE X @X limits its search to training exams that have theexact same tags as the ones predicted (during the tagging stage) forthe test exam. However, the whole database is searched, when noexams exist with the same tags. We note that all the embeddings arefirst normalized (using L ), so that the cosine similarity between atest embedding and all the training embeddings in the database iscomputed with a single matrix (element wise) multiplication. Thisreduces the search time from minutes to milliseconds, making thismethod in effect the most efficient compared to its competitors. In this section, we describe the datasets used for our experiments, weprovide details on the experimental setup, and present our results.
Datasets that can be used for
Diagnostic Captioning comprise med-ical images and associated diagnostic reports . We are aware offour such publicly available datasets. Namely, IU X-R AY , MIMIC-CXR, PEIR G ROSS and ICLEF
CAPTION , but we employ only theformer two, which are of high quality [21]. IU X-R AY : IU X-ray [6] is a collection of radiology exams, includ-ing chest radiographs, abnormality tags, and radiologist narrativereports, and is publicly available through the Open Access Biomed-ical Image Search Engine (OpenI). The dataset consists of 3,995reports (one report per patient) and 7,470 frontal or lateral radio-graphs, with each radiology report consisting of an ‘Indication’ (e.g.,symptoms), a ‘Comparison’ (e.g., previous information about thepatient), a ‘Findings’ and an ‘Impression’ section. Each report con-tains two groups of tags. First, there are manual tags assigned bytwo trained coders, each comprising a heading (disorder, anatomy,object, or sign) and subheadings (e.g., ‘Hiatal/large’, where ‘large’indicates the anatomical site of the disease). Second, the ‘Findings’and ‘Impression’ sections were used to associate each report with anumber of automatically extracted tags, produced by Medical TextIndexer [31] (MTI tags). An example case is that shown in Fig. 1,where it can be seen that the MTI tags are simple words or terms(e.g., ‘Hiatus’).For the ranking stage of our methodology, each exam was labeledas abnormal , if one or more manual abnormality tags were assigned,and normal , otherwise (the tag ‘normal’ or ‘no indexing’ was as-signed). For the tagging stage of our methodology, we employedthe MTI codes, because the manual codes do not explicitly describethe abnormality, but most often also include other information (e.g.,anatomical site). For the explanation stage, we employed the ‘Find-ings’ section. Also, in our experiments we used only exams with twoimages considering this to be the standard (one frontal and one lat-eral radiograph), and excluded the rest. We also discarded the examsthat did not have a ‘Findings’ section. This resulted in 2,790 exams,from which 1,952 are used for training, 276 for validation and 562 We limit our radiography exams to datasets with reports in English. PEIR G
ROSS comprises medical images and photographs, mainly for educationalpurposes. ICLEF
CAPTION comprises images extracted from scientific articles and usesthe caption of each such image as the respective report. https://openi.nlm.nih.gov/ for testing. The class ratio in the dataset is slightly imbalanced with39% normal radiology exams. Abnormal exams are assigned with 3tags on average, while the most frequent tag is ‘degenerative change’(216 exams). The length of the diagnostic text in each report is 40words on average. For the normal exams the diagnostic text can beexactly the same for many different patients, e.g., the following find-ing ‘The heart is normal in size. The mediastinum is unremarkable.The lungs are clear.’ appears in 29 exams. By contrast, the mostfrequent abnormal report appeared exactly the same in 7 reports.
MIMIC-CXR:
This dataset comprises 377,110 chest radiographsassociated with 227,835 radiography exams which come from 64,588patients of the Beth Israel Deaconess Medical Center between 2011-2016. As in IU X-R AY , reports in MIMIC are organized in sections,but some reports include additional sections such as ‘History’, ‘Ex-amination’, or ‘Technique’, but not in a consistent manner, becausethe structure of the reports and the section names were not enforcedby the hospital’s user interface [16]. The current version of thedataset does not contain the initial labels, so we re-produced themby applying the C HE X PERT disease mention labeler [13] on the re-ports as described in Johnson et al. [16]. C HE X PERT classifies textsinto 14 labels (13 diagnoses and ‘No Finding’), each as ‘negative’,‘positive’, or ‘uncertain’ for a specific text. We treated those labeleduncertain as positive. For the ranking step, we labeled exams asnormal when the ‘No Finding’ label was assigned. In total, there are40,306 exams with two images that correspond to 29,482 patients.After removing 11 exams that did not have a ‘Findings’ section,which we used for the explanation stage of our RTE X , we split thedataset to 70% (training), 10% (validation), and 20% (test) withrespect to patients. For our experiments we randomly kept one examper patient and sampled 2,300 patients from the training set, 300from the validation set and 650 from the test set; with 68% of thisfinal dataset consisting of normal exams. Each abnormal exam has2 labels on average, while the most common label is ‘Pneumonia’.The average diagnostic text length is 55 words. In this dataset manynormal cases have the same diagnostic text, e.g, the most commonnormal caption appears in 53 exams. Considering only the abnormalexams the most frequent caption appears 4 times. For each of the three stages of RTE X we benchmark each techniqueagainst competitors. Next, we outline the competitor methods andthe performance metrics used for each benchmark. We investigated one baseline method,referred to as R
ANDOM , and two competitor methods, referred to asCNN+NN and CNN+ K NN, for both ranking and tagging stages.The methods were benchmarked against RTE X @R and RTE X @T,respectively. For the two competitors, the ranking is determinedbased on the produced tags by these methods trained on both normaland abnormal exams. Moreover, at the tagging stage, the tags areobtained by retraining the same methods only on abnormal exams.Next, we describe the baseline as well as the two tagging methods. R ANDOM . This is a baseline method used both for ranking and tag-ging and simulates the case where no screening is performed. For the We used the same split as in Li et al. [23, 24] MIMIC-CXR v2.0.0, https://mimic-cxr.mit.edu/ onference’17, July 2017, Washington, DC, USA Vasiliki Kougia and John Pavlopoulos* and Panagiotis Papapetrou and Max Gordon ranking task it randomly returns a number serving as the abnormalityprobability. For tagging, it simply assigns a set of random tags fromthe training set. The number of tags assigned is the average numberof tags per training exam.
CNN+NN.
This method employs a DenseNet-121 CNN [12] en-coder. It is pre-trained on ImageNet and fine-tuned on our datasets(IU X-R AY or MIMIC-CXR). CNN+NN encodes all images (fromthe training and test sets) and concatenates the obtained representa-tions for each radiograph in an exam ( S j ), to yield a single represen-tation per exam ( V j ). Then, for each test representation, the cosine similarity against all the training representations is computed and thenearest exam is returned. When generating tags then the abnormalitytags of the nearest exam are returned and assigned to the test exam. CNN+ K NN.
This method is an extension of CNN+NN that usesthe k -most similar training exams to compute the tags T j for exam S j . To constrain the number of returned tags ( | T j | ), only the r mostfrequent tags of the k exams are held. Moreover, we set r to be theaverage number of tags per exam of the particular k retrieved exams.We observe that CNN+ K NN is considered a very strong baseline fortagging. It was ranked third in a recent medical tagging competition[20]. The first two methods are RTE X @T (see Section 3.2.2) andan ensemble of CNN+ K NN and RTE X @T, respectively.For solving the problem of ranking, we adapted CNN+NN andCNN+ K NN as follows. The abnormality tags of the most similarradiography exam in the training set are returned and a probabilityscore p is computed using the following formula: P = (cid:205) t ∈ T j rel ( t )|G| , (1)where G are all the ground truth tags of the dataset, T j are thegenerated tags for radiography exam S j and rel ( t ) = when t ∈G and zero otherwise. P will usually be close to zero. The mainintuition is that the more the assigned tags, the higher the P and thelikelier it is that this exam is abnormal. Evaluation metrics.
Ranking methods were evaluated in terms of nDCG @ k , with a varying k . We also used Precision @ k , but prelim-inary experiments showed that this measure correlates highly with nDCG @ k . Tagging methods were evaluated in terms of F k . Weused the top- k abnormal cases (ranked by RTE X @R) to computethe F1 score between their predicted and their gold tags. We benchmarked three competi-tors for the task of diagnostic captioning showing the benefits interms of clinical correctness when using the generated tags.
S&T
This method was introduced by Vinyals et al. [43] for im-age captioning and it is only applicable for the stage of diagnosticcaptioning. As the encoder of the S&T architecture we employ theDenseNet-121 [12] CNN, which is used to initialize an LSTM-RNNdecoder [10]. A dense layer on top outputs a probability distributionover the words of the vocabulary, so that the decoder generates aword at a time. The word generation process continues until a special‘end’ token is produced or the maximum caption length is reached.
S&T+
This method extends S&T (also applicable solely to diagnos-tic captioning), so that the generated text explains the predicted tags.Hence, after the encoding phase and prior to the decoding phase(before the generation of the first word), the tags are provided to the decoder, as if they were words of the diagnostic text; similar to teacher forcing [9]. Since the decoder is an RNN, this acts as a priorduring the decoding that will follow. E T D This method follows a tag and image constrained Encoder-Decoder architecture. A DenseNet-121 CNN [12] yields one visualembedding per exam. The decoder is an LSTM constrained from thevisual embedding and the tags T j ∈ T that were assigned to exam H j ∈ H k during the previous step (see Section 3.2.2). We call thismethod E T D. More formally, the decoder at each time step s learns ahidden state h s as the non-linear combination (the weight matrix W is learned) of the input word x s and the previous hidden state h s − : i s = σ ( W i · [ x s , V j , E j , h s − ] + b i ) f s = σ ( W f · [ x s , V j , E j , h s − ] + b f ) o s = σ ( W o · [ x s , V j , E j , h s − ] + b o ) q s = tanh ( W q · [ x s , V j , E j , h s − ] + b q ) c s = f s · c s − + i s · q s h s = o s · tanh ( c s ) , where i s , f s are the LSTM input and forget gates regulating theinformation from this and the previous cell to be forgotten. V j isthe visual representation from the last average pooling layer of theDenseNet encoder. E j is the centroid of the word embeddings of thetags T j : E j = | T j | (cid:213) t ∈ T j W e · t For all the text generation methods mentioned above, we prepro-cessed the text by tokenizing, lower-casing the words, removingdigits and words with length 1. We used the Adam optimizer [18] ev-erywhere with initial learning rate 10e-3. RTE X @T and RTE X @Rused a learning rate reduced mechanism [37]. Evaluation metrics.
We employed both word-overlap and clinicalcorrectness measures to evaluate the system-produced diagnostictext. The most common word-overlap measures in diagnostic cap-tioning are BLEU [35] and ROUGE-L [25]. BLEU is precision-based and measures word n-gram overlap between the produced andthe ground truth texts. ROUGE-L measures the ratio of the lengthof the longest common n-gram shared by the produced text andthe ground truth texts, to either the length of the ground truth text(ROUGE-L Recall) or the length of the generated text (ROUGE-LPrecision). We employ the harmonic mean of the two (ROUGE-LF-measure). For the implementations of BLEU and ROUGE-L, weused respectively sacrebleu and MSCOCO . To evaluate the clin-ical correctness, following the work of [26], we used the CheXPertlabeler [13] to extract labels from both the ground truth and thesystem-generated diagnostic texts. Clinical precision (CP) is thenthe average number of labels shared between the ground truth andsystem-generated texts, to the number of labels of the latter. Sim-ilarly, clinical recall (CR) is the average number of labels sharedbetween the ground truth and system-generated texts, to the numberof labels of the former. https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/sacrebleu.py https://github.com/salaniz/pycocoevalcap/tree/master/rouge TE X : A novel methodology for Ranking, Tagging, and Explanatory diagnostic captioning of radiography exams Conference’17, July 2017, Washington, DC, USA (a) MIMIC-CXR. (b) IU X-ray. Figure 5: NDCG@K of all methods for the task of ranking ra-diography exams based on the probability of abnormality forMIMIC-CXR (a) and IU X-R AY (b). We used bootstrapping(1000 samples of 100 exams each) and report the average value. K varies from 10 to 80 and moving average was used with a win-dow of 5. For MIMIC-CXR we observe that RTE X @R andRTE X @T consistently outperform the other methods, while forIU X-R AY the winners are RTE X @R and CNN+ K NN. (a) MIMIC-CXR. (b) IU X-ray.
Figure 6: F1 of diagnostic tagging methods, on the top100 ranked radiography exams. The cases were rankedby RTE X @R, based on their abnormality probability forMIMIC-CXR (a) and IU X-R AY (b). We observe thatRTE X @R is the winner for both datasets, with CNN+ K NN be-ing the second best by up to a factor of two for MIMIC-CXR.
Next, we present our results with regard to ranking, tagging, and di-agnostic captioning. Finally, we provide a discussion of our findingsand assess the overall performance of RTE X . Fig. 5 (a) and (b) depict the performance of themethods in terms of
N DCG @ K . We used bootstrapping, sampling100 exams at a time, varying K from 10 to 80 radiography exams.R ANDOM is outperformed by all competitors, while RTE X @R isthe overall winner for both datasets, with the second best beingRTE X @T for MIMIC-CXR and CNN+ K NN for IU X-R AY . During this step we assume that theradiography exams are already ranked based on an abnormalityprobability. Thus, we evaluate various methods with respect to theirability to correctly detect the correct abnormality tags. We report
Macro F1 (macro averaging across exams), which is also the standardmeasure of a recent competition on medical term tagging [36]. As itcan be seen in Fig, 6, RTE X @T outperforms the two competitors Similar results were obtained in terms of
Precision @ K . in both datasets, with the second best being CNN+ K NN with adifference of up to a factor of two for MIMIC-CXR.
Dataset Model BLEU ROU CP CR
MIMIC-CXR S&T@
ALL T D 6.9 25.5 0.171 0.144RTE X @X 5.9 20.5 IU X-ray S&T@
ALL T D X @X 5.5 20.2 ALL ). Our RTE X @Xoutperforms all other methods in clinical precision and recall. Table 1 provides the results of themethods for the task of
Diagnostic Captioning . We considered as ground truth , i.e., set G , the correct reports and as predicted captions the system-produced diagnostic texts. Our RTE X @X outperformsall methods in terms of clinical precision and recall. Generativemodels achieve higher word-overlap scores, mainly because theylearn to repeat common phrases that exist in the reports. On theother hand, retrieval methods assign texts that are written fromradiologists, so they have a higher clinical value. When trainingS&T on all exams (S&T@ ALL ), using both normal and abnormalcases, clinical precision and recall decrease in both datasets. Bycontrast, the performance in terms of word-overlap measures (BLEUand ROUGE-L) was slightly improved overall, probably because thedecoder is now better in generating text present in normal reports,which however is also present in abnormal reports (see Fig. 1).
As a final benchmark we calculated the runtimeof RTE X on ranking, tagging, and captioning on 500 randomly se-lected radiography exams from our IUXray test set. Ranking lasted19.78 seconds. Producing tags and diagnostic texts for the top 100ranked exams lasted 19.43 seconds. Nonetheless, all 100 top-rankedexams in this experiment were abnormal. Note that an experiencedradiologist needs 2 minutes on average [33] for reporting a radiogra-phy exam, hence 200 minutes for 100 exams. The experiment wasperformed on a 32-core server with 256GB RAM and 4GPUs. Repeatability.
For repeatability purposes, the code for the best per-forming pipeline of RTE X is available on github. We introduced a new methodology that can be used for (1) rank-ing radiography exams based on the probability of containing anabnormality, (2) producing diagnostic tags using abnormal examsfor training, and (3) providing diagnostic text produced based on https://github.com/ipavlopoulos/rtex.git onference’17, July 2017, Washington, DC, USA Vasiliki Kougia and John Pavlopoulos* and Panagiotis Papapetrou and Max Gordon both the radiographs and tags, as means of explaining the predictedtags. This is an important step for practitioners to prioritize caseswith abnormalities. Our methodology can be further used to predictabnormality tags and complement them with an automatically sug-gested explanatory diagnostic text to guide the medical expert. Weexperimented with two publicly available datasets showing that ourranking and tagging components outperform two strong competitorsand a baseline. Our diagnostic captioning component demonstratesthe benefit of employing tags for generating text of higher clinicalcorrectness. We also demonstrated that limiting our training datato only abnormal exams improves the clinical correctness of theautomatically provided text. Future directions include further experi-mentation with data of a larger scale and deployment to hospitals. REFERENCES [1] H. J. W. L. Aerts, E. R. Velazquez, R. T. H. Leijenaar, et al. 2014. Decodingtumour phenotype by noninvasive imaging using a quantitative radiomics approach.
Nature Communications
Neuro-computing
311 (2018), 291–304.[3] I. M. Baltruschat, H Nickisch, M. Grass, T. Knopp, and A. Saalbach. 2019.Comparison of deep learning approaches for multi-label chest X-ray classification.
Scientific reports
9, 1 (2019), 1–10.[4] D. Bluemke, L. Moy, M. A. Bredella, B. B. Ertl Wagner, K J. Fowler, V J. Goh,E F. Halpern, C. P. Hess, M L. Schiebler, and C R. Weiss. 2020. Assessingradiology research on artificial intelligence: A brief guide for authors, reviewers,and readersâ ˘AˇTFrom the radiology editorial board.[5] A. Garcıa Seco de Herrera, C. Eickhoff, V. Andrearczyk, and H. Müller. 2018.Overview of the ImageCLEF 2018 Caption Prediction Tasks. In
CLEF CEURWorkshop . Avignon, France.[6] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez,S. Antani, G. R. Thoma, and C. J. McDonald. 2015. Preparing a Collection ofRadiology Examinations for Distribution and Retrieval.
Journal of the AmericanMedical Informatics Association
23, 2 (2015), 304–310.[7] C. Eickhoff, I. Schwall, A. Garcıa Seco de Herrera, and H. Müller. 2017. Overviewof ImageCLEFcaption 2017 - the Image Caption Prediction and Concept Extrac-tion Tasks to Understand Biomedical Images. In
CLEF CEUR Workshop . Dublin,Ireland.[8] W. Gale, L. Oakden-Rayner, G. Carneiro, A. P. Bradley, and L. J. Palmer. 2018.Producing Radiologist-Quality Reports for Interpretable Artificial Intelligence.
CoRR abs/1806.00340 (2018). arXiv:arxiv1806.00340[9] I. Goodfellow, Y. Bengio, and A. Courville. 2016.
Deep learning . MIT press.[10] S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory.
NeuralComputation
9, 8 (1997), 1735–1780.[11] M. D. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. 2019. A comprehensivesurvey of deep learning for image captioning.
CSUR
51, 6 (2019), 118.[12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. DenselyConnected Convolutional Networks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition . CVPR, Hawaii, HI, USA, 4700–4708.[13] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B.Haghgoo, R. Ball, K. Shpanskaya, et al. 2019. Chexpert: A large chest radiographdataset with uncertainty labels and expert comparison.
CoRR abs/1901.07031(2019). arXiv:arXiv:1901.07031[14] S. Jaeger, A. Karargyris, S. Candemir, J. Siegelman, L. Folio, S. Antani, and G.Thoma. 2013. Automatic screening for tuberculosis in chest radiographs: a survey.
Quantitative imaging in medicine and surgery
3, 2 (2013), 89.[15] B. Jing, P. Xie, and E. Xing. 2018. On the Automatic Generation of MedicalImaging Reports. In
Proceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) . Association forComputational Linguistics, Melbourne, Australia, 2577–2586.[16] A. E. W. Johnson, T. J. Pollard, S. Berkowitz, N. R. Greenbaum, M. P. Lungren,C.-Y. Deng, R. G. Mark, and S. Horng. 2019. MIMIC-CXR: A large publiclyavailable database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019).[17] Md. Karim, T. Döhmen, D. Rebholz-Schuhmann, S. Decker, M. Cochez, and O.Beyan. 2020. Deepcovidexplainer: Explainable covid-19 predictions based onchest x-ray images. arXiv preprint arXiv:2004.04582 (2020).[18] D. P. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014).[19] B. Kong, Y. Zhan, M. Shin, T. Denny, and S. Zhang. 2016. Recognizing end-diastole and end-systole frames via deep temporal regression network. In
In-ternational Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, Cham, Athens, Greece, 264–272.[20] V. Kougia, J. Pavlopoulos, and I. Androutsopoulos. 2019. Aueb NLP Group atImageCLEFmed Caption 2019. In
CLEF2019 Working Notes. CEUR WorkshopProceedings . Lugano, Switzerland, 09–12.[21] V. Kougia, J. Pavlopoulos, and I. Androutsopoulos. 2019. A Survey on BiomedicalImage Captioning. In
Workshop on Shortcomings in Vision and Language of theAnnual Conference of the North American Chapter . ACL, Minneapolis, MN, USA,26–36.[22] E. A. Krupinski. 2010. Current perspectives in medical image perception.
Atten-tion, Perception, & Psychophysics
72, 5 (2010), 1205–1217.[23] Y. Li, X. Liang, Z. Hu, and E. Xing. 2018. Hybrid retrieval-generation reinforcedagent for medical image report generation. In
Advances in neural informationprocessing systems . NIPS, Montreal, Canada, 1530–1540.[24] Y. Li, X. Liang, Z. Hu, and E. Xing. 2019. Knowledge-driven encode, retrieve,paraphrase for medical image report generation. In
Proceedings of the AAAIConference on Artificial Intelligence . AAAI, Hawaii, U.S.A., 6666–6673.[25] C.-Y. Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In
Workshop on Text Summarization Branches Out of the Annual Conference of theAssociation for Computational Linguistics . Barcelona, Spain, 74–81.[26] G. Liu, T.-M. H. Hsu, M. McDermott, W. Boag, W.-H. Weng, P. Szolovits, andM. Ghassemi. 2019. Clinically Accurate Chest X-Ray Report Generation.
CoRR abs/1904.02633 (2019). arXiv:arXiv:1904.02633[27] X. Liu, H. R. Tizhoosh, and J. Kofman. 2016. Generating binary tags for fastmedical image retrieval based on convolutional nets and radon transform. In
International Joint Conference on Neural Networks . IEEE, IJCNN, Vancouver,British Columbia, Canada, 2872–2878.[28] X. Liu, Q. Xu, and N. Wang. 2019. A survey on deep neural network-based imagecaptioning.
The Visual Computer
35, 3 (2019), 445–470.[29] J. Lu, C. Xiong, D. Parikh, and R. Socher. 2017. Knowing when to look: Adaptiveattention via a visual sentinel for image captioning. In
CVPR . Honolulu, HI, USA,375–383.[30] M. M. A. Monshi, J. Poon, and V. Chung. 2020. Deep learning in generatingradiology reports: A survey.
Artificial Intelligence in Medicine (2020), 101878.[31] J. G. Mork, A. Jimeno-Yepes, and A. R. Aronson. 2013. The NLM MedicalText Indexer System for Indexing Biomedical Literature. In
BioASQ Workshop .Valencia, Spain.[32] Royal College of Radiologists. 2016. Clinical radiology UK workforce census2015 report.[33] Royal College of Radiologists. 2019. Clinical radiology UK workforce census2019 report.[34] L. G. Oliveira, S. A. e Silva, L. H. V. Ribeiro, M. R. de Oliveira, C. J. Coelho,and A. L. Andrade. 2008. Computer-aided diagnosis in chest radiography fordetection of childhood pneumonia.
International journal of medical informatics
77, 8 (2008), 555–564.[35] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method forautomatic evaluation of machine translation. In
ACL . ACL, Philadelphia, PA,USA, 311–318.[36] O. Pelka, C. M. Friedrich, A. G. S. de Herrera, and H. Müller. 2019. Overview ofthe ImageCLEFmed 2019 concept detection task.
CLEF working notes, CEUR (2019).[37] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, et al. 2017. Chexnet: Radiologist-Level Pneumonia Detection on Chest X-rays with Deep Learning. arXiv (2017).[38] H. R. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. Cherry, L. Kim, and R. M. Summers.2015. Improving Computer-Aided Detection Using Convolutional Neural Net-works And Random View Aggregation.
IEEE Transactions On Medical Imaging
35, 5 (2015), 1170–1181.[39] K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[40] L. Soldaini and N. Goharian. 2016. Quickumls: a Fast, Unsupervised Approach forMedical Concept Extraction. In
Medical Information Retrieval (MedIR) Workshop .ACM SIGIR, Pisa, Italy.[41] P. Suetens. 2009.
Fundamentals of Medical Imaging . Cambridge University Press.https://books.google.gr/books?id=iUHgx5E4zLMC[42] A. Taguchi. 2010. Triage screening for osteoporosis in dental clinics usingpanoramic radiographs.
Oral diseases
16, 4 (2010), 316–327.[43] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and Tell: A NeuralImage Caption Generator. In
Proceedings of the IEEE conference on computervision and pattern recognition . CVPR, Boston, MA, USA, 3156–3164.[44] X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers. 2018. Tienet: Text-ImageEmbedding Network for Common Thorax Disease Classification and Reporting inChest X-rays. In
CVPR . CCPVR, Quebec City, Canada, 9049–9058.[45] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y.Bengio. 2015. Show, attend and tell: Neural image caption generation with visualattention. In
ICML . Lille, France, 2048–2057.[46] Z. Zhang, Y. Xie, F. Xing, M. McGough, and L. Yang. 2017. MDNet: A Se-mantically and Visually Interpretable Medical Image Diagnosis Network. In