[PDF] Addressing Data Bias Problems for Chest X-ray Image Report Generation

Abstract

Automatic medical report generation from chest X-ray images is one possibility for assisting doctors to reduce their workload. However, the different patterns and data distribution of normal and abnormal cases can bias machine learning models. Previous attempts did not focus on isolating the generation of the abnormal and normal sentences in order to increase the variability of generated paragraphs. To address this, we propose to separate abnormal and normal sentence generation by using two different word LSTMs in a hierarchical LSTM model. We conduct an analysis on the distinctiveness of generated sentences compared to the BLEU score, which increases when less distinct reports are generated. We hope our findings will help to encourage the development of new metrics to better verify methods of automatic medical report generation.

Full PDF

HHARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION Addressing Data Bias Problems for ChestX-ray Image Report Generation

Philipp Harzig [email protected] Yan-Ying Chen [email protected] Francine Chen [email protected] Rainer Lienhart [email protected] Multimedia Computing and ComputerVision LabUniversity of AugsburgAugsburg, Germany FX Palo Alto Laboratory3174 Porter Drive,Palo Alto, CA, USA

Abstract

Automatic medical report generation from chest X-ray images is one possibility forassisting doctors to reduce their workload. However, the different patterns and data dis-tribution of normal and abnormal cases can bias machine learning models. Previousattempts did not focus on isolating the generation of the abnormal and normal sentencesin order to increase the variability of generated paragraphs. To address this, we proposeto separate abnormal and normal sentence generation by using a dual word LSTM in ahierarchical LSTM model. In addition, we conduct an analysis on the distinctivenessof generated sentences compared to the BLEU score, which increases when less distinctreports are generated. Together with this analysis, we propose a way of selecting a modelthat generates more distinctive sentences. We hope our ﬁndings will help to encouragethe development of new metrics to better verify methods of automatic medical reportgeneration.

Deep Convolutional Neural Networks in combination with Recurrent Neural Networks area common architecture used to automatically generate descriptions of images. These recentadvances have not left other areas such as medical research untouched. Demner-Fushmanet al. [1] released an anonymized dataset which contains 7470 chest X-ray images associ-ated with doctors’ reports and tag information specifying medical attributes. However, an-notating these domain-speciﬁc datasets requires expert-knowledge and cannot be achievedcost-efﬁciently like more common datasets. In addition, medical data is connected with highprivacy concerns and also regulated, e.g., by the Health Insurance Portability and Account-ability Act (HIPAA).Therefore, only a limited amount of data is publicly available. Especially, there is onlyone public dataset [1] that connects chest X-ray images with medical reports. In this dataset,there are far more sentences describing normalities than abnormalities. Thus, most machine c (cid:13) a r X i v : . [ c s . C V ] A ug HARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION

UID:

CXR1001

Impression:

Diffuse ﬁbrosis. No visiblefocal acute disease.

Indication: dyspnea, subjective fevers,arthritis, immigrant from Bangladesh

Findings:

Interstitial markings are diffuselyprominent throughout both lungs. Heart sizeis normal. Pulmonary XXXX normal.

Problems:

Markings; Fibrosis

MeSH:

Markings/lung/bilateral/interstitial/diffuse/prominent; Fibrosis/diffuse

Figure 1: An example from the IU chest X-ray dataset, which shows an abnormal case withﬁndings. We highlight the sentences with our human abnormality annotation, i.e., normalsentences are written in blue and abnormal sentences in green.learning models are biased to generate normal results with higher probability than abnormalresults. However, abnormalities are more important and more difﬁcult to detect given thesmall number of examples. In this work, we address this issue with a new architecture,which can distinguish between generating abnormal or normal sentences.Furthermore, common machine translation metrics such as BLEU [14] may not be thebest choice, when even one word - such as ‘no’ - contained in a paragraph can make a hugedifference for the indication and ﬁndings. Also, calculating these metrics over an imbalanceddataset raises the issue that sentences about normal cases are over-weighted and results in lessdiversity in the generated reports. We examine these issues of common machine translationmetrics when used on a dataset of medical reports such as in our work.Our contributions: (1) We annotate each sentence of a public dataset with abnormallabels and (2) use these labels to train a new hierarchical LSTM with dual word LSTMscombined with an abnormal sentence predictor to reduce the data bias. (3) We analyze thecorrelation between machine translation metrics and the variability in generated reports andﬁnd that a high score calculated over a dataset does not necessarily imply a result to rely on.

In the ﬁeld of combining computer vision and machine learning with medical chest X-rayimages, Wang et al. [18] published the large Chest-Xray14 dataset, which includes a collec-tion of over 100,000 chest X-rays annotated with 14 common thorax diseases. This datasethas been widely [10, 15, 19] used for predicting and localizing thorax diseases. The diseaselabels of this dataset were automatically extracted from the doctors’ reports. However, thedoctors’ reports are not available publicly. Demner-Fushman et al. [1] are the ﬁrst to releasea rather large anonymized dataset consisting of chest X-rays paired with doctors’ reports,indications and manually annotated disease labels. We use this dataset in our work.Automatically generating captions from images is a well-researched topic. Nowadays,most architectures use an encoder-decoder structure, where a Convolutional Neural Network(CNN) is used to encode images into a semantic representation and a Recurrent Neural Net-work (RNN) decoder generates the most likely sentence given this image representation, e.g.,[4, 5, 17]. Krause et al. [7] extended the work by introducing a hierarchical LSTM structureto generate longer sequences for describing an image with a paragraph.Jing et al. [3] used a hierarchical LSTM to generate a doctor’s report with multiple sen-

ARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION rank f sentence rank f sentence1 947 no acute cardiopulmonary abnormality ... ... ...2 698 the lungs are clear. 8018 1 mild right basilar airspace consolidation may ...3 523 no pneumothorax. 8019 1 calciﬁed granuloma is seen in the left medial...4 451 lungs are clear. 8020 1 old rib fractures healed.5 394 no acute cardiopulmonary ﬁndings. 8021 1 negative for pneumothorax pneumomediastinum or...... ... ... 8022 1 it is unchanged compared to a for the abdomen ... Table 1: Distinct sentences sorted top-down by their number of appearances f .tences, and use a co-attention mechanism that attends to visual and semantic features, whichare generated by medical tags annotated within the Indiana University chest X-ray collec-tion [1]. Li et al. [8] describe a hybrid reinforced agent that decides during the process ofcreating every single sentence if it should be retrieved from a template library or generatedin a hierarchical fashion. Instead of a hierarchical model, Xue et al. [20] use a bidirectionalLSTM to encode semantic information of the previously generated sentence as guidancefor an attention mechanism to generate an attentive context vector for the current sentence.Wang et al. [19] presented a joint framework, which simultaneously predicts one of 14 dis-eases and generates a report on the Chest-Xray14 dataset. However, the textual annotationsare not available to the public as of yet. They use a single LSTM that produces a reportconditioned on the previous hidden state, the previously generated words and image featuresextracted by a CNN. In a more recent work, Li et al. [9] use a graph transformer to decom-pose visual features into an abnormality graph, which is decoded as a template sequence andparaphrased into a generated report.Our work is based on a hierarchical LSTM structure [3, 7] and introduces an abnormalsentence predictor in combination with a dual word LSTM for separately generating abnor-mal and normal sentences. In addition, we do not use any templates for sentence generationin contrast to [8, 9]. For our work, we use the Indiana University chest X-ray Collection [1] (IU chest X-raydataset), which contains 7,470 chest X-Ray images with multiple annotations. These includeindication, ﬁndings, impressions in a textual form and MTI (Medial Text Indexer) encodings.The MTI encodings are automatically extracted keywords from the indication and ﬁndings.We identify 121 unique MTI labels in dataset and use these labels for an additional trainingsignal. Additionally the authors manually annotated the images with MEDLINE R (cid:13) MedicalSubject Headings R (cid:13) (MeSH R (cid:13) ). To summarize, this public dataset contains 3,955 narrativereports, each associated with MeSH tags and two views of the chest, i.e., a Posteroanterior(PA) and a lateral view. We set the doctor’s report to be the concatenation of the impressionand ﬁndings similarly to other works [3, 19]. We show one example from this dataset inFigure 1.It is very difﬁcult for a machine learning model to properly learn the task of generatingfull paragraphs of doctors’ reports from this small number of examples. Especially, wenotice that most of the reports consist of repeating and very similar sentences, which are ofdescriptive nature and do not describe abnormalities and diseases. In Table 1, we list thefrequency f of distinct sentences within the doctor’s reports, i.e., all sentences that appearat least once in the dataset sorted top-down with most frequent sentences listed on top. We HARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION

Image ResNet-152ImageEmbedding AttentionLSTMsentence mh sent m h sent m − t m φ m L stop τ m L abnormal LSTM abnormal

LSTM normal L hierarchical Generated ReportMTIPrediction L MTI R × × v e ∈ R × v att ,m ∈ R > . ≤ . Figure 2: Our dual word LSTM model. The green boxes show the CNN image encoder, theimage embedding and the MTI tag prediction. The blue part depicts the sentence LSTM,i.e., the topic generator t m and the stop prediction φ m . The red part shows an abnormalsentence predictor ( τ m ) and two word LSTMs for generating abnormal and normal sentences,respectively.notice a long-tail distribution with abnormal sentences often only occurring with a frequencyof f = f <

3. Machine learning models produce a probability distribution, thus, wealways get the most probable doctor’s report given the input image. However, most of theimages in the dataset depict normal cases and it is difﬁcult to generate accurate reports forabnormal cases. Considering that identifying abnormalities and diseases is the most crucialpart in this problem domain, we want to address the data bias problems for chest X-ray imagereport generation.

We depict our model architecture in Figure 2. The input to our model are single images, i.e.,either the lateral view or PA view of a chest X-ray image. We use the res4b35 feature mapof the ResNet-152 [2] ˜v ∈ R × × as our image features. v e ∈ R × × embeds theseimages features into a lower dimensional space for further use. It is reshaped into a featuremap of shape R × enabling a soft attention mechanism to attend to 196 different spatiallocations. Unless otherwise noted, all embedding and hidden dimensions are set to 512 inour model. Even though LSTMs were designed to combat the issue of forgetting long-term dependen-cies, they still have problems keeping information for very long time-periods, e.g., overmultiple sentences. Krause et al. [7] address this problem by splitting the generation into ahierarchical LSTM, which consists of two independent LSTMs. The sentence LSTM’s solepurpose is to generate topic vectors, which in turn are used for the initialization of the wordLSTM. The word LSTM then generates a single sentence conditioned on the topic vector.

ARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION Jing et al. [3] extend the hierachical LSTM for generating medical reports from chest X-rayimages. We also use a hierarchical concept with an architecture that differs from Jing etal. [3]. For example, we add a multi-task learning objective on MTI tags (see Section 4.3)and do not use the Co-Attention mechanism in our model.

Sentence LSTM

We initialize the sentence LSTM on image features extracted by the en-coder CNN. However, we use a soft attention mechanism v m = f att ( v e , h sent m − ) to attend todifferent spatial areas within the feature map conditioned on the sentence LSTM’s hiddenstate h sent m − of the preceding sentence. In subsequent sentences, we use the correspondingpreceding hidden state, which we depicted by the dotted arrow in Figure 2. In order togenerate the topic vector for sentence m , we apply the sentence LSTM to the attentive im-age features v m to get an intermediate hidden state h sent m for the current sentence and feed itthrough a fully-connected layer t m = relu ( W sent h sent m ) (1)to generate a topic vector, where W sent ∈ R word embedding dim . Stop Prediction

We also use the sentence LSTM’s current and previous hidden state topredict if we should continue generating sentences ( z m =

0) or stop generating them ( z m = φ m ) is a fully-connected layer φ m = W stop tanh ( W stop , m − h sent m − + W stop , m h sent m ) , (2)where W stop , W stop , m − and W stop , m are parameter matrices. We train the stop predictionwith a sigmoid cross-entropy loss L stop = − ∑ M − m = z m · log ( σ ( φ m ))+( − z m ) · log ( − σ ( φ m )) ,where σ is the Sigmoid function and M is the number of sentences in the current paragraph. Dual Word LSTMs

A word LSTM is trained to maximize the probability of predictingthe ground-truth word w m , t at timestep t of sentence m . The hierachical LSTM softmaxcross-entropy loss is then deﬁned by L hierachical = M − ∑ m = N m − ∑ t = w m , t log ( o word m , t ) , (3)where N m are the number of words in sentence m and o word m , t the output of the word LSTMat timestep t of sentence m . The input to time step t is the embedded ground truth word W e w m , t − , where W e is the word embedding matrix.Depending on whether the current sentence is of type abnormal or normal, we train adifferent set of word LSTM parameters. In other words, we have an abnormal word LSTMand a normal word LSTM, which are trained when the label of the current sentence is ab-normal and normal, respectively. In practice, we set the loss weights for the current sentenceto 1 in the abnormal word LSTM and to 0 in normal word LSTM. In the case of a normalsentence, we set the loss weights inversely. During inference phase, we use the prediction ofthe abnormal sentence prediction module (see Section 4.2) to decide whether we want to usethe generated sentence from the abnormal word LSTM or the normal word LSTM. We thenconcatenate sentences from both the abnormal word LSTM and the normal word LSTM intoour ﬁnal paragraph. HARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION

As we already argued in Section 3, the dataset consists of many distinct normal sentences, butonly few different sentences exist that describe abnormalities. We integrate an abnormalityprediction module, which tries to infer whether the semantic meaning of topic vector t m doesdescribe an abnormality or not. We use a fully-connected layer τ m with one output neuronto predict the probability for a sentence to be abnormal or not. We train the fully-connectedlayer with a sigmoid cross-entropy loss L abnormal .We manually annotated the IU chest X-ray dataset for every sentence within the ground-truth paragraph of every sample in the training dataset. Two annotators labeled whether asentence is an abnormal case or not with the help of the provided MeSH tags. In addition, wealso implemented a method for automatically annotating the sentences by comparing wordembedding distances against MeSH tag embeddings although we use manual annotationsfor training. We use Word2Vec [11, 12] embeddings trained on Pubmed and Wikipedia [13]which can reduce human efforts when the dataset is scaled up. We use the global average pool of the image embedding ˆv ∈ R = avg_pool ( v e ) for predict-ing the MTI annotations. As it is common in multi-label classiﬁcation, we use the sigmoidcross-entropy loss function L MTI appended to a fully-connected layer with one output neuronfor every distinct MTI label. For our experiments, we optimize the total loss L total = λ stop · L stop + λ hierarchical · L hierarchical + λ abnormal · L abnormal + λ MTI · L MTI , (4)where λ ( · ) are the weighting factors for each loss. λ MTI is set to 10 and λ hierarchical , λ stop and λ abnormal are set to 1. We set λ abnormal to 0 for experiments in which we disable the dualLSTM approach, i.e., we only use a single word LSTM similar to Jing et al. [3]. In thiscase, we also calculate L hierarchical with only one word LSTM. When using the abnormal andnormal word LSTMs, L hierarchical is the sum of the two individual word LSTM losses, i.e., o word m , t from Equation 3 is the output of the abnormal or normal word LSTM, depending onwhether the ground-truth annotation is abnormal or normal.We train the image embedding layer with both the hierachical LSTM and the MTI predic-tor, so the captioning task can beneﬁt from our multi-task loss function. We use the Adam [6]optimizer with a base learning rate of η = · − and do not use learning rate decay. Wetrain for up to 250 epochs and use a batch size of 16. In the following, we present an evaluation study and results generated by our hierarchicalmodels with dual word LSTMs,

HLSTM+Dual and

HLSTM+att+Dual . We choose to com-pare against the CNN-RNN [17] baseline which we trained ourselves on our train split. Wealso compared our model against the scores reported in CoAtt [3] and KERP [9]. Thesemodels were pretrained on a non-public dataset of chest X-ray images with Chinese reports,which were collected by a professional medical institution for health checking [8]. KERP [9]uses templates which is different in comparison to end-to-end generation approaches. How-ever, since these methods were evaluated on a different dataset split, the scores are not di-rectly comparable to ours. We therefore implemented hierarchical LSTM baselines similar

ARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION Model B-1 B-2 B-3 B-4 Cider Meteor Rouge-LCNN-RNN [17] 31.9 (33.3) 19.8 (20.5) 13.3 (13.6) 9.4 (9.4) 29.1 (30.6) 13.5 (14.5) 26.8 (27.2)CoAtt * [3] — (45.5) — (28.8) — (20.5) — (15.4) — (27.7) — (—) — (36.9)KERP * [9] — (48.2) — (32.5) — (22.6) — (16.2) — (28.0) — (—) — (33.9)HLSTM 36.4 ( ) 23.2 (23.8) 16.1 (16.3) 11.4 (11.4) 29.1 (29.3) 15.5 (15.7) 30.6 (30.2)HLSTM+att 35.1 (36.6) 22.8 (23.4) 16.1 (16.4) 11.6 (11.7) 34.3 (32.3) 14.9 (15.6) 29.7 (29.9)HLSTM+Dual 35.2 (35.8) 22.8 (23.1) 15.9 (16.0) 11.3 (11.2) 34.8 (32.2) 14.6 (15.1) 29.5 (29.6)HLSTM+att+Dual 35.7 (37.3) 23.3 ( ) 16.5 ( ) 11.8 ( ) 34.0 ( ) 15.6 ( ) 31.3 ( ) Table 2: Results (in %) on the validation and (test) set calculated with common machinetranslation metrics. B-n stands for BLEU-n which uses up to n -grams. We selected themodel conﬁguration and hyperparameters based on the validation set. HLSTM / HLSTM+att are our hierarchical LSTM implementations similar to [3, 7], and are evaluated on our datasetsplits. . . . +Dual are our models. * Scores taken from [9], who used a different dataset split.to [7] which were referred to in CoAtt [3]. These hierarchical LSTM baselines with andwithout an attention mechanism are named

HLSTM and

HLSTM+att in our paper, and areevaluated on our dataset split. As we do not have access to the CX-CHR dataset from [8],we did not employ any pretraining of the feature extractor network.

We choose our dataset split by randomly shufﬂing the dataset and splitting it into a train,validation and test set with a ratio of 0 .

9, 0 .

05 and 0 .

05, respectively. We make sure that im-ages of an individual patient are only present in either one of train, validation or test set. Weuse the validation set for selection of hyperparameters and architectural decisions. In prac-tice, we select the best model checkpoint based on two criteria. First, we calculate metricssuch as BLEU-n twice per training epoch. Second, we also calculate the number of distinctsentences generated over the whole validation dataset for each sentence index m within aparagraph. We choose our ﬁnal models by calculating these criteria over the validation set.(1) The ﬁrst sentences m = HLSTM and

HLSTM+Dual in Figure 4.

We observed a severe disadvantage in solely using scores such as BLEU-n as the evaluationcriteria. As we mentioned before, we calculated the number of distinct sentences per sen-tence index m within a generated paragraph for each validation point. We noticed that highscores do not necessarily imply a high variability in generated sentences. Most notably, thehighest scores can sometimes be observed when there are only 1 or 2 distinct sentences persentence index resulting in very few different paragraphs. In Figure 3, we show the numberof distinct sentences for sentence indices m = m = HLSTM+att stays in the samelimited range over the course of training. For example, it has a higher score even though itgenerates the very same paragraph for every sample in the validation set at training iteration5838 (visualized by the vertical black line) than at training iteration 73809 (visualized by

HARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION B - s c o r e o n v a li d a t i o n s e t HLSTM+att B - s c o r e o n v a li d a t i o n s e t HLSTM+att+Dual d i s t i n c t s e n t e n c e s sentence idx m =0sentence idx m =1 020406080100120140160180 d i s t i n c t s e n t e n c e s sentence idx m =0sentence idx m =1 Figure 3: The number of distinct sentences of sentence m = m = HLSTM+att and

HLSTM+att+Dual .The solid line represents the training iteration with the maximum BLEU-4 score and thedashed line our selected model.

Model m Table 3: Number of distinct sentences in the ground-truth (GT) and generated on the valida-tion split of the dataset per sentence index m ∈ [ , ] within the generated paragraph.dashed line). For the model HLSTM+att+Dual , we see that the score drops as more distinctsentences are generated. For this model, we also see that there is much more variabilityof sentences from the beginning on and also far more distinct sentences are generated incontrast to only using a single word LSTM. If we look at the number of distinct sentencesgenerated per sentence index m in our chosen models compared to the ground-truth in Ta-ble 3, we still see a huge gap. Note that paragraphs with mostly one distinct sentence persentence index do not have additional beneﬁt, since they are not dependent on the input im-age. Considering that many sentences within the ground-truth only differ slightly but have asynonymous meaning, we ﬁnd that results which do not possibly have the maximum scorebut a higher variability in generated paragraphs describe the input images in a better way.Thus, we also use a minimum threshold of distinct sentences as one stopping criterion. The test scores of our models and the baselines are presented in Table 2 (in brackets). Over allevaluation metrics, our

HLSTM+att+Dual model has the most improvement on Cider [16],which is designed for evaluating image descriptions, uses human consensus and considersthe TF-IDF for weighting n-grams. This implies that our

HLSTM+att+Dual model cancatch more distinct n-grams in the reference paragraph. In addition, our

HLSTM+att+Dual model is consistently better than other baselines in multi-gram BLEU, Meteor and Rouge-L,indicating that the relevance is not sacriﬁced while distinctiveness is increased. In addition,

ARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION Model B-1 B-2 B-3 B-4 Cider Meteor Rouge-LHLSTM+att 30.9 (44.4) 19.0 (30.1) 12.9 (21.8) 9.1 (15.8) 25.9 (42.6) 12.8 (22.2) 25.0 (38.6)HLSTM 32.3 (43.5) 19.4 (29.7) 12.8 (21.3) 8.8 (15.3) 24.6 (37.1) 13.2 (21.7) 25.8 (38.2)HLSTM+Dual (41.2) (28.1) (20.0) (13.9) (31.8) 13.2 (19.5) 26.1 (36.0)HLSTM+att+Dual 31.8 ( ) 19.8 ( ) 13.5 ( ) 9.7 ( ) 28.4 ( ) ( ) ( ) Table 4: Final results (in %) on the held-out test-set of the dataset for ABNORMAL and(NORMAL) images only.

Generated Caption

HLSTM : exam quality limited byhypoinﬂation and rotation. the heart is normal in size. the lungs areclear. no focal consolidation suspicious pulmonary opacity largepleural effusion or pneumothorax is identiﬁed. no pneumothorax.no acute bony abnormalities. no pleural effusion Generated Caption

HLSTM+Dual : technically limited exam.basilar probable pulmonary ﬁbrosis and scarring. the heart is mildlyenlarged. there are low lung volumes with bronchovascularcrowding. there is interstitial opacity and left basal platelikeopacity due to discoid atelectasis scarring. there is nopneumothorax. no large pleural effusion GT:

Stable enlarged cardiomediastinal silhouette. Tortuous aorta.Low lung volumes and left basilar bandlike opacities suggestive ofscarring or atelectasis. No overt edema. Question small rightpleural effusion versus pleural thickening. No visiblepneumothorax.

Figure 4: Examples of generated paragraphs with our model

HLSTM+Dual vs.

HLSTM incomparison with the ground-truth paragraph.we also compared our models with the dual word LSTM from Section 4.2 against the vanillaHLSTM model inspired by Jing et al. [3]. As we already mentioned in Section 5.2, thenumber of distinct sentences per each sentence index starts to grow more rapidly when usingtwo word LSTMs, which can be seen in the right part of Figure 3 when comparing it to the

HLSTM+att model on the left. We can also see that generating more distinct sentences doesnot account for better scores. However, when looking at the validation and test scores inTable 2 the dual word LSTM models often have higher scores than the single word LSTMmodels.In Table 4, we report scores on the held-out test set for abnormal as well as normal (inbrackets) images. The best-performing model for both normal and abnormal images was oneof our dual models. The results also indicate that the performance is best on normal imagesand so effort should be given to further improve performance on abnormal images.

In our work, we presented a hierarchical LSTM architecture expanded by a dual word LSTM.Paired with an abnormality prediction module, we introduced dual word LSTMs, which areresponsible for generating abnormal and normal sentences, respectively.We then examined the correlation between the BLEU-n metrics and the number of dis-tinct sentences generated by our model and observed that common evaluation metrics suchas BLEU-4 do not necessarily imply a good evaluation criteria for multi-sentence medicalreports, i.e., for one of our models the highest score was produced by only generating the HARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION same paragraph for every input image. In addition, note that the dual word LSTM can helpto increase the number of distinct sentences faster when selecting a corresponding stoppingcriterion.In the future, we want to focus on working on a metric more suitable for the critical areaof medical report generation from images and addressing abnormal indications and ﬁndings,since their performance is worse than those of normal indications and ﬁndings.

This work was done during Philipp Harzig’s internship at FX Palo Alto Laboratory. Hethanks his colleagues from FXPAL for the collaboration, advice and for providing an openand inspiring research environment. We also thank Eric Rosenberg for helping annotate theground-truth sentences with abnormal/normal labels.

References [1] Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, LaritzaRodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparinga collection of radiology examinations for distribution and retrieval.

Journal of theAmerican Medical Informatics Association , 23(2):304–310, 2015.[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 770–778, 2016.[3] Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation of medicalimaging reports. In

Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , pages 2577–2586. Associationfor Computational Linguistics, 2018. URL http://aclweb.org/anthology/P18-1240 .[4] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional lo-calization networks for dense captioning. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 4565–4574, 2016.[5] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating imagedescriptions. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 3128–3137, 2015.[6] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[7] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical ap-proach for generating descriptive image paragraphs. In

Computer Vision and PatternRecognition (CVPR), 2017 IEEE Conference on , pages 3337–3345. IEEE, 2017.[8] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hybrid retrieval-generation reinforced agent for medical image report generation. arXiv preprintarXiv:1805.08298 , 2018.

ARZIG ET AL.: DATA BIAS PROBLEMS FOR CXR MEDICAL REPORT GENERATION [9] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Knowledge-driven encode,retrieve, paraphrase for medical image report generation. AAAI Conference on ArtiﬁcialIntelligence , 2019.[10] Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-Jia Li, and F Li. Tho-racic disease identiﬁcation and localization with limited supervision. arXiv preprintarXiv:1711.06373 , 2017.[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation ofword representations in vector space.

Proceedings of Workshop at ICLR, 2013 , 2013.[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributedrepresentations of words and phrases and their compositionality. In

Advances in neuralinformation processing systems , pages 3111–3119, 2013.[13] SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. Distributional semantics re-sources for biomedical text processing. In

Proceedings of the 5th International Sympo-sium on Languages in Biology and Medicine, Tokyo, Japan , pages 39–43, 2013.[14] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a methodfor automatic evaluation of machine translation. In

Proceedings of the 40th annualmeeting on association for computational linguistics , pages 311–318. Association forComputational Linguistics, 2002.[15] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, TonyDuan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet:Radiologist-level pneumonia detection on chest x-rays with deep learning. arXivpreprint arXiv:1711.05225 , 2017.[16] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 4566–4575, 2015.[17] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Aneural image caption generator. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 3156–3164, 2015.[18] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, andRonald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and bench-marks on weakly-supervised classiﬁcation and localization of common thorax diseases.In

Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on , pages3462–3471. IEEE, 2017.[19] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M Summers. Tienet:Text-image embedding network for common thorax disease classiﬁcation and reportingin chest x-rays. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 9049–9058, 2018.[20] Yuan Xue, Tao Xu, L Rodney Long, Zhiyun Xue, Sameer Antani, George R Thoma,and Xiaolei Huang. Multimodal recurrent model with attention for automated radiol-ogy report generation. In