[PDF] Leveraging Medical Visual Question Answering with Supporting Facts

Abstract

In this working notes paper, we describe IBM Research AI (Almaden) team's participation in the ImageCLEF 2019 VQA-Med competition. The challenge consists of four question-answering tasks based on radiology images. The diversity of imaging modalities, organs and disease types combined with a small imbalanced training set made this a highly complex problem. To overcome these difficulties, we implemented a modular pipeline architecture that utilized transfer learning and multi-task learning. Our findings led to the development of a novel model called Supporting Facts Network (SFN). The main idea behind SFN is to cross-utilize information from upstream tasks to improve the accuracy on harder downstream ones. This approach significantly improved the scores achieved in the validation set (18 point improvement in F-1 score). Finally, we submitted four runs to the competition and were ranked seventh.

Full PDF

LLeveraging Medical Visual Question Answeringwith Supporting Facts

Tomasz Kornuta, Deepta Rajan, Chaitanya Shivade,Alexis Asseman, and Ahmet S. Ozcan

IBM Research AI, Almaden Research Center, San Jose, USA { tkornut,drajan,cshivade,asozcan } @us.ibm.com,[email protected] Abstract.

In this working notes paper, we describe IBM Research AI(Almaden) team’s participation in the ImageCLEF 2019 VQA-Med com-petition. The challenge consists of four question-answering tasks basedon radiology images. The diversity of imaging modalities, organs anddisease types combined with a small imbalanced training set made this ahighly complex problem. To overcome these diﬃculties, we implementeda modular pipeline architecture that utilized transfer learning and multi-task learning. Our ﬁndings led to the development of a novel modelcalled Supporting Facts Network (SFN). The main idea behind SFN isto cross-utilize information from upstream tasks to improve the accu-racy on harder downstream ones. This approach signiﬁcantly improvedthe scores achieved in the validation set (18 point improvement in F-1 score). Finally, we submitted four runs to the competition and wereranked seventh.

Keywords:

ImageCLEF 2019 · VQA-Med · Visual Question Answering · Supporting Facts Network · Multi-Task Learning · Transfer Learning

In the era of data deluge and powerful computing systems, deriving meaningfulinsights from heterogeneous information has shown to have tremendous valueacross industries. In particular, the promise of deep learning-based computa-tional models [15] in accurately predicting diseases has further stirred greatinterest in adopting automated learning systems in healthcare [3]. A daunt-ing challenge within the realm of healthcare is to eﬃciently sieve through vastamounts of multi-modal information and reason over them to arrive at a diﬀeren-tial diagnosis. Longitudinal patient records including time-series measurements,text reports and imaging volumes form the basis for doctors to draw conclusiveinsights. In practice, radiologists are tasked with reviewing thousands of imag-ing studies each day, with an average of about three seconds to mark them asanomalous or not, leading to severe eye fatigue [25]. Moreover, clinical work-ﬂows have a sequential nature tending to cause delays in triage situations, wherethe existence of answers to key questions about a patient’s holistic conditionscan potentially expedite treatment. Thus, building eﬀective question-answering a r X i v : . [ c s . C V ] M a y T. Kornuta et al. systems for the medical domain by bringing advancements in machine learningresearch will be a game changer towards improving patient care.Visual Question Answering (VQA) [17,2] is a new exciting problem domain,where the system is expected to answer questions expressed in natural languageby taking into account the content of the image. In this paper, we present resultsof our research on the VQA-Med 2019 dataset [1]. The main challenge here, incomparison to the other recent VQA datasets such as TextVQA [24] or GQA [10],is to deal with scattered, noisy and heavily biased data. Hence, the dataset servesas a great use-case to study challenges encountered in practical clinical scenarios.In order to address the data issues, we designed a new model called

Support-ing Facts Network (SFN) that eﬃciently shares knowledge between upstreamand downstream tasks through the use of a pre-trained multi-task solver in com-bination with task-speciﬁc solvers. Note that posing the VQA-Med challenge asa multi-task learning problem [5] allowed the model to eﬀectively leverage andencode relevant domain knowledge. Our multi-task SFN model outperforms thesingle task baseline by better adapting to label distribution shifts.

The VQA-Med 2019 [1] is a Visual Question Answering (VQA) dataset embeddedin the medical domain, with a focus on radiology images. It consists of: – a training set of 3,200 images with 12,792 Question-Answer (QA) pairs, – a validation set of 500 images with 2,000 QA pairs, and – a test set of 500 images with 500 questions (answers were released after theend of the VQA-Med 2019 challenge).In all splits the samples were divided into four categories, depending on the maintask to be solved: – C1 : determine the modality of the image, – C2 : determine the plane of the image, – C3 : identify the organ/anatomy of interest in the image, and – C4 : identify the abnormality in the image.Our analysis of the dataset (distribution of questions, answers, word vocabular-ies, categories and image sizes) led to the following ﬁndings and system-designrelated decisions: – merge of the original training and validation sets, shuﬄe and re-sample newtraining and validation sets with a proportion of 19:1, – use of weighted random sampling during batch preparation, – addition of a ﬁfth Binary category for samples with Y/N type questions, – focus on accuracy-related metrics instead of the BLEU score, – avoid label (answer classes) uniﬁcation and cleansing, – consider C4 as a downstream task and exclude it from the pre-training ofinput fusion modules, – utilization of image size as an additional input cue to the system. upporting Facts Network 3 Typical VQA systems process two types of input, visual (image) and language(question), that need to undergo various transformations to produce the answer.Fig. 1 presents a general architecture of such systems, indicating four majormodules: two encoders responsible for encoding raw inputs to more useful rep-resentations, followed by a reasoning module that combines them and ﬁnally, ananswer decoder that produces the answer.

Image ReasoningModuleImageEncoderQuestionEncoder AnswerDecoderQuestion Answer

Fig. 1.

General architecture of Visual Question Answering systems

In the early prototypes of VQA systems, reasoning modules were rather sim-ple and relied mainly on multi-modal fusion mechanisms. These fusion tech-niques varied from concatenation of image and question representations, to morecomplex pooling mechanisms such as Multi-modal Compact Bilinear pooling(MCB) [7] and Multi-modal Low-rank Bilinear pooling (MLB) [12]. Further,diverse attention mechanisms such as question-driven attention over image fea-tures [11] were also used. More recently, researchers have focused on complexmulti-step reasoning mechanisms such as Relational Networks [22,6] and Mem-ory, Attention and Composition (MAC) networks [9,18]. Despite that, certainempirical studies indicate early fusion of language and vision signals signiﬁcantlyboosts the overall performance of VQA systems [16]. Therefore, we explored theﬁnding of an ”optimal” module for early fusion of multi-modal inputs.

One of our ﬁndings from analyzing the dataset was to use the image size as ad-ditional input cue to the system. This insight triggered an extensive architecturesearch that included, among others, comparison and training of models with: – diﬀerent methods for question encoding, from 1-hot encoding with Bag-of-Words to diﬀerent word embeddings combined with various types of recur-rent neural networks, – diﬀerent image encoders, from simple networks containing few convolutionallayers trained from scratch to ﬁne-tuning of selected state-of-the-art modelspre-trained on ImageNet, – various data fusion techniques as mentioned in the previous section. T. Kornuta et al.

Image QuestionTokenizerQuestion FusedInputsFeature MapExtraction WordEmbeddings QuestionEncoderFusion IImagesize Image SizeEncoder Fusion II (a)

Input Fusion

QuestionTokenizerQuestion WordEmbeddings QuestionEncoderClassifier Category (b)

Question CategorizerFig. 2.

Architectures of two modules used in the ﬁnal system

The ﬁnal architecture of our model is presented in Fig. 2a. We used GloVeword embeddings [20] followed by Long Short-Term Memory (LSTM) [8]. TheLSTM outputs along with feature maps extracted from images using VGG-16 [23] were passed to the

Fusion I module, implementing question-drivenattention over image features [11]. Next, the output of that module was con-catenated in the

Fusion II module with image size representation created bypassing image width and height through a fully connected (FC) layer.Note that the green colored modules were initially pre-trained on externaldatasets (ImageNet and 6B tokens from Wikipedia 2014 and Gigaword 5 datasetsfor VGG-16 and GloVe models respectively) and later ﬁne-tuned during trainingon the VQA-Med dataset.

During the architecture search of the

Input Fusion module we used the modelpresented in Fig. 3, with a simple classiﬁer with two FC layers. These models weretrained and validated on C1 , C2 and C3 categories separately, while excluding C4 . In fact, to test our hypothesis we trained some early prototypes only onsamples from C4 and the models failed to converge. ImageQuestion AnswerInputFusion ClassifierImagesize

Fig. 3.

Architecture with a single classiﬁer (IF-1C)

After establishing the

Input Fusion module we trained it on samples from C1 , C2 and C3 categories. This served as a starting point for training morecomplex reasoning modules. At ﬁrst, we worked on a model that exploited in-formation about 5 categories of questions by employing 5 separate classiﬁerswhich used data produced by the Input Fusion module. Each of these clas-siﬁers essentially specialized in one question category and had its own answer upporting Facts Network 5 label dictionary and associated loss function. The predictions were then fed tothe

Answer Fusion module, which selected answers from the right classiﬁerbased on the question category predicted by the

Question Categorizer mod-ule, whose architecture is shown in Fig. 2b. Please note that we pre-trained themodule in advance on all samples from all categories and froze its weights duringthe training of classiﬁers.

ImageQuestion InputFusionImagesize Classifier C3 AnswerFusionClassifier C2Classifier C4Classifier Y/NClassifier C1QuestionCategorizer AnswerSupport C1Support C2Support C3Fusion III

Fig. 4.

Final architecture of the Supporting Facts Network (SFN)

The architecture of our ﬁnal model,

Supporting Facts Network is pre-sented in Fig. 4. The main idea here resulted from the analysis of questions aboutthe presence of abnormalities – to answer which the system required knowledgeon image modality and/or organ type. Therefore, we divided the classiﬁcationmodules into two networks: Support networks (consisting of two FC layers) andﬁnal classiﬁers (being single FC layers). We added Plane ( C2 ) as an additionalsupporting fact. The supporting facts were then concatenated with output from Input Fusion module in

Fusion III and passed as input to the classiﬁer spe-cialized on C4 questions. In addition, since Binary

Y/N questions were presentin both C1 and C4 categories, we followed a similar approach for that classiﬁer. All experiments were conducted using PyTorchPipe [14], a framework that fa-cilitates development of multi-modal pipelines built on top of PyTorch [19]. Ourmodels were trained using relatively large batches (256), dropout (0 .

5) and Adamoptimizer [13] with a small learning rate (1 e − ’supporting facts’ over thebaseline model with a single classiﬁer. The SFN model by our team achieved abest score of (0 .

558 Accuracy, 0 .

582 BLEU score) on the test set as indicated

T. Kornuta et al.Resampled Valid. Set Original Train. Set Original Valid. SetModel Prec. Recall F-1 Prec. Recall F-1 Prec. Recall F-1IF-1C 0.630 0.435 0.481 0.683 0.497 0.545 0.690 0.499 0.548SFN 0.759 0.758 0.758 0.753 0.692 0.707 0.762 0.704 0.717

Table 1.

Summary of experimental results. All columns contain average scores achievedby 5 separately trained models on resampled training and validation sets. We alsopresent scores achieved by the models on original sets (in the evaluation mode). by the CrowdAI leaderboard. One of the reasons for such a signiﬁcant drop inperformance is due to the presence of new answers classes in the test set thatwere not present both in the original training and validation sets.

In this work, we introduced a new model called

Supporting Facts Network (SFN), that leverages knowledge learned from combinations of upstream tasksin order to beneﬁt additional downstream tasks. The model incorporates domainknowledge that we gathered from a thorough analysis of the dataset, resulting inspecialized input fusion methods and ﬁve separate, category-speciﬁc classiﬁers.It comprises of two pre-trained shared modules followed by a reasoning modulejointly trained with ﬁve classiﬁers using the multi-task learning approach. Ourmodels were found to train faster and to deal much better with label distributionshifts under a small imbalanced data regime.Among the ﬁve categories of samples present in the VQA-Med dataset, C4 and Binary turned out to be extremely diﬃcult to learn, for several reasons.First, there were 1483 unique answer classes assigned to 3082 training samplesrelated to C4 . Second, both C4 and Binary required more complex reasoningand, besides, might be impossible to conclude by looking only at the questionand content of the image. However, our observation that some of the informationfrom simpler categories might be useful during reasoning on more complex ones,we reﬁned the model by adding supporting networks. Given, modality, imagingplane and organ typically help narrow down the scope of disease conditionsand/or answer whether or not an abnormality is present. Our empirical studiesprove that this approach performs signiﬁcantly better, leading to an 18 pointimprovement in F-1 score over the baseline model on the original validation set.

References

1. Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., M¨uller, H.:VQA-Med: Overview of the Medical Visual Question Answering Task at Image-CLEF 2019. In: CLEF2019 Working Notes. CEUR Workshop Proceedings (2019)upporting Facts Network 72. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh,D.: Vqa: Visual question answering. In: Proceedings of the IEEE internationalconference on computer vision. pp. 2425–2433 (2015)3. Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse,D., Etemadi, M., Ye, W., Corrado, G., Naidich, D.P., Shetty, S.: End-to-end lungcancer screening with three-dimensional deep learning on low-dose chest computedtomography. Nature Medicine (2019)4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors withsubword information. Transactions of the Association for Computational Linguis-tics , 135–146 (2017)5. Caruana, R.: Multitask learning. Machine learning (1), 41–75 (1997)6. Desta, M.T., Chen, L., Kornuta, T.: Object-based reasoning in VQA. In: 2018 IEEEWinter Conference on Applications of Computer Vision (WACV). pp. 1814–1823.IEEE (2018)7. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-modal compact bilinear pooling for visual question answering and visual grounding.In: EMNLP (2016)8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)9. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine rea-soning. In: CVPR (2018)10. Hudson, D.A., Manning, C.D.: Gqa: a new dataset for compositional questionanswering over real-world images. arXiv preprint arXiv:1902.09506 (2019)11. Kazemi, V., Elqursh, A.: Show, ask, attend, and answer: A strong baseline forvisual question answering. arXiv preprint arXiv:1704.03162 (2017)12. Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard productfor low-rank bilinear pooling. In: ICLR (2017)13. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)14. Kornuta, T.: Pytorchpipe. https://github.com/ibm/pytorchpipe (2019)15. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature (7553), 436 (2015)16. Malinowski, M., Doersch, C.: The Visual QA devil in the details: The impact ofearly fusion and batch norm on CLEVR. In: ECCV’18 Workshop on Shortcomingsin Vision and Language (2018)17. Malinowski, M., Fritz, M.: A multi-world approach to question answering aboutreal-world scenes based on uncertain input. In: Advances in neural informationprocessing systems. pp. 1682–1690 (2014)18. Marois, V., Jayram, T., Albouy, V., Kornuta, T., Bouhadjar, Y., Ozcan, A.S.: Ontransfer learning using a MAC model variant. In: NeurIPS’18 Visually-GroundedInteraction and Language (ViGIL) Workshop (2018)19. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A.: Automatic diﬀerentiation in pytorch (2017)20. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)21. Romanov, A., Shivade, C.: Lessons from natural language inference in the clinicaldomain http://arxiv.org/abs/1808.06752http://arxiv.org/abs/1808.06752