Hierarchical Deep Multi-modal Network for Medical Visual Question Answering
HHierarchical Deep Multi-modal Network for MedicalVisual Question Answering
Deepak Gupta ∗ , Swati Suman , Asif Ekbal Department of Computer Science and EngineeringIndian Institute of Technology Patna, India
Abstract
Visual Question Answering in Medical domain (VQA-Med) plays an importantrole in providing medical assistance to the end-users. These users are expectedto raise either a straightforward question with a
Yes/No answer or a challengingquestion that requires a detailed and descriptive answer. The existing techniquesin VQA-Med fail to distinguish between the different question types sometimescomplicates the simpler problems, or over-simplifies the complicated ones. It iscertainly true that for different question types, several distinct systems can leadto confusion and discomfort for the end-users. To address this issue, we proposea hierarchical deep multi-modal network that analyzes and classifies end-userquestions/queries and then incorporates a query-specific approach for answerprediction. We refer our proposed approach as H ierarchical Q uestion S egregationbased V isual Q uestion A nswering, in short HQS-VQA. Our contributions arethree-fold, viz. firstly, we propose a question segregation (QS) technique for VQA-Med; secondly, we integrate the QS model to the hierarchical deep multi-modalneural network to generate proper answers to the queries related to medicalimages; and thirdly, we study the impact of QS in Medical-VQA by comparingthe performance of the proposed model with QS and a model without QS. Weevaluate the performance of our proposed model on two benchmark datasets, viz. ∗ Corresponding author
Email addresses: [email protected] (Deepak Gupta), [email protected] (Swati Suman), [email protected] (Asif Ekbal) Both authors contributed equally to this work. a r X i v : . [ c s . C L ] S e p AD and CLEF18. Experimental results show that our proposed HQS-VQAtechnique outperforms the baseline models with significant margins. We alsoconduct a detailed quantitative and qualitative analysis of the obtained resultsand discover potential causes of errors and their solutions.
Keywords:
Visual Question Answering, Neural Networks, Medical Domain,Support Vector Machine, Gated Recurrent Units
1. Introduction
The advancement in the field of Computer Vision (CV) (Arai & Kapoor, 2019;Guo et al., 2020; Li et al., 2020) and Natural Language Processing (NLP) (Ruder,2019; Ruder et al., 2019; Fu, 2019; Dong et al., 2019) over the last decade, hasintroduced several interesting machine learning techniques. The problems suchas object detection (Liu et al., 2020), segmentation (Liu et al., 2019), and imageclassification (Sun et al., 2020a,b) in CV, and machine translation (Yang et al.,2020), question answering (Gupta et al., 2018b, 2019; Chen et al., 2016; Guptaet al., 2018a,c), biomedical and clinical text mining (Ningthoujam et al., 2019;Yadav et al., 2018, 2019, 2020; Chen et al., 2019), speech recognition (Magnusonet al., 2020) in NLP, are being solved much more efficiently than ever before. Thishas facilitated the researchers to indulge into solving interdisciplinary problemsthat demand knowledge of both the fields.Visual Question Answering (VQA) (Gao et al., 2019; Kafle et al., 2020) hasemerged as one such problem. In VQA, the task is poised as questions beingasked with respect to an image, where the machine needs to learn and generateanswers of such questions based on the learned features of the input image. Incontrast to the typical CV tasks which largely focus on solving problems such as action identification , image classification , VQA tasks are relatively complex. Itdemands more intelligence like object recognition, semantic feature extraction,external knowledge, and common sense knowledge. Many domain-specific VQAtasks have surfaced in the last few years, and VQA in the medical domain is2ne such that plays a significant role in providing medical assistance. Sincethis task is related to the medical domain, end-users can be categorized basedon the type of queries raised by them. Patients, medical students, and relatedusers are anticipated to ask elementary questions mostly having Yes/No as theanswer. On the other hand, clinicians and medical experts are expected to raisea more problem-specific query demanding a detailed and descriptive answer. Forthis reason, different portals must be created to satisfy the query-specific need,but that would lead to confusion and discomfort to end-users. To tackle thisproblem, in this paper, we propose a question segregation technique to segregatethe user queries. For this module, we use a simple statistical machine learningmodel, based on simple hand-engineered, and word frequency-based features.Towards the solution of the problem of generic VQA-Med, we propose ahierarchical deep multi-modal network that analyzes and classifies the queriesat the root of the hierarchy and then incorporates the query-specific approachat leaf nodes. To predict the answers in this model, we generate the questionand image representations using Bidirectional Long Short Term Memory (Bi-LSTM) (Graves & Schmidhuber, 2005) and Inception-Resnet-v2 (Szegedy et al.,2017), respectively. We fuse the representations together and pass it to thespecific answer prediction model at the leaf node. For the task of questionclassification in the root node, we propose a question segregation technique. Weuse Support Vector Machine (SVM) (Cristianini et al., 2000) as the classifierwith hand-engineered and word frequency-based features for QS. We use themachine learning technique for QS, as the rule-based strategy suffers from theproblem of defining too many rules that may not extend to other datasets (Clarket al., 2018). The following examples from RAdiology Data (RAD) (Lau et al.,2018) show the difficulty of the rule-based approach in the medical domain. • Question:
Evidence of hemorrhage in the kidneys?
Answer: No Question-type: ‘Yes/No’ Question:
Is the spleen present?
Answer: on the patient’s left
Question-type: ‘Others’
Careful analysis of the question reveals that the first example expects a de-scriptive type answer that is to list out the facts that indicate kidney hemorrhage(Question-type: ‘Others’ ), while the second example expects to confirm thepresence/absence of spleen (Question-type: ‘Yes/No’ ). The presence of suchanomalies in the question acts as a hindrance in the formation of robust rulesfor the classification of questions into their correct type.We perform all our experiments in the RAD and ImageCLEF2018 VQA-Med 2018 (CLEF18) datasets, as they perfectly capture the problem statementthat we intend to solve. Detailed discussion on the dataset can be found inSection 4. Experimental evaluation demonstrates promising results, showingthe effectiveness of our proposed approach. Additionally, error analysis of thesystem’s outputs shows the future direction in this research area by addressingdifferent kinds of errors.The organization of this paper is as follows. We first discuss the related workin VQA. Then we present the details of the methodologies that we implementedto solve our specific problem. In particular, we explain our proposed HQS-VQAmodels in detail. Basically, we discussed the technique used for the questionsegregation module and the VQA components used to generate the query-specificanswers. Details of experiments along with the evaluation results and necessaryanalysis are reported. Finally, we conclude and provide the future directions ofour work.
The motivation behind our work are stemmed from the following facts:4
Nowadays, medical and physiological images (“ct-scan”, “x-ray”, etc.) andreports for the patients are easily accessible with the increase in the useof medical portals. But the patient still needs to visit a medical expertto fully understand those reports and get answers to their queries. Thisprocess is both time-consuming and cost-sensitive. On the other hand,clinicians also find an efficient VQA system very useful to understand theresults of a complex medical image. They may use such a system as asecond opinion just to boost their self-confidence in the understandingof some specific aspects of such medical images. Although it is possibleto search queries on search engines, the search results may be inaccurate,spurious, vague, and, or enormous. In the case of medical reports suchinaccurate, spurious, or vague results could lead to serious after-effects.In this context, the VQA in the medical domain is getting attention asan important research problem trying to provide the answer to end-userqueries related to medical images. • VQA-Med intends to assist patients and clinicians in general, but can also beuseful in medical education. A clinical apprentice or medical students whojust started learning the basics of handling images of different modalities,may learn by asking queries and getting answers. Thus, developing anefficient and automated VQA system for the medical domain comes outas an essential task. Even though many medical datasets are publishedpublicly, most of them deal with some specific disease in a particularbody part with a fixed image modality. ImageCLEFtuberculosis task (Cidet al., 2018) is one such example which was published to build modelsfor detection, classification, and severity measurement of TB from theprovided chest-CT. On the contrary, it is a more challenging problem tohave a generic strategy that deals with VQA queries regarding multipleimage modalities linked to multiple diseases that may appear in any partof the body. The solution to such a generic problem will be considerablyless confusing to the patients. 5
The difference in the type of end-users results in different query types.The queries from patients most of the time are expected to be generic innature, having the need to produce ‘
Yes ’ or ‘ No ’ as the answer. Whereas,the queries from clinicians and medical experts are expected to be moreproblem-specific, which requires elaborate answers. Again, a skilled traineeis expected to ask more specific and sophisticated questions, while queriesfrom beginners are likely to be simple and straightforward. For example, anaive trainee may inquire about the presence of any abnormality in theimage, whereas a senior trainee may identify the abnormality of ‘ intraven-tricular hemorrhage ’ from the image, and want to understand more aboutthe grade and effect of the hemorrhage. They can then draw inferencesfrom the acquired data for effective treatment. • This difference in query type thus needs different problem-specific attention,which needs to be dealt with isolation. Again, multiple end systems formultiple types of queries may create confusion, and discomfort to theend-user. There should be a single end-user module to solve both thecomplex and simple queries. Table 1 demonstrates one such system whereany clinically relevant question can be asked about the image. Here, imageplays an important role as the answer to the questions may vary based onthe provided image.
Image Question Answer
Is this a cyst in the left lung? NoHas the left lung collapsed? YesWhere is the nodule? Below the 7th rib in theright lungWhat are the densities in both mid-lung fields? pleural plaques
Table 1.
Sample, question-answer pairs formulated from a single image. More than one clinically relevant questions can be asked froma given image. • We identify this need, and propose a SVM-based question segregation6echnique to segregate the questions. We then use this information topropose a hierarchical deep multi-modal network to generate the answers. • We propose a SVM based Question Segregation technique for the task ofquestion classification for VQA in the medical domain. • We propose a hierarchical deep multi-modal neural model and integrate itwith the proposed question segregation module to generate proper answersto the queries related to medical images. • We study the impact of QS in Medical-VQA by comparing the performanceof the proposed model with question segregation and a model withoutsuch segregation. We also compare it with the baseline models to studyits effectiveness. • We evaluate our model on two different datasets,which demonstrate thefact that our proposed method is generic in nature.
2. Related Work
The major challenges of the VQA-Med are closely related to the generalVQA and QA in the medical domain and we see a lot of interesting solutionsevolving over time. We present the survey with respect to the related datasetsand methods in the following subsections.
A number of research projects have been initiated for the development ofbenchmark datasets to promote the works in the medical domain. The Genomiccorpus released as part of the TREC (Hersh & Bhupatiraju, 2003) task is one7f the benchmark datasets for the medical QA task. It focuses exclusively onscientific papers. However, a small number of questions in the dataset are notsufficient to evaluate the efficiency of the large-scale QA systems. This constraintled to the release of several other datasets, such as Question Answering forMachine Reading Evaluation (QA4MRE) (Morante et al., 2013) and BiomedicalSemantic Indexing and Question Answering (BioASQ) (Tsatsaronis et al., 2015).The QA4MRE consists of the Biomedical Text on Alzheimer’s Disease, whileBioASQ gathers information from various heterogeneous sources to addressreal-life questions from the biomedical experts. A number of datasets, such asMRI-DIR (Ger et al., 2018), fastMRI (Zbontar et al., 2018), and a few more(Bradley et al., 2017; Vallieres et al., 2017; Shaimaa et al., 2017) focused ondifferent medical tasks, are also available. However, the images in the VQA-Meddataset have different modalities and contain radiological markings such as shortinformation, tags, etc. It may also contain a stack of sub-images that is not thecase with the existing medical datasets. In addition, general VQA datasets (Linet al., 2014; Mukuze et al., 2018; Gebhardt & Wolf, 2018; Antol et al., 2015) aretask-specific, unlike VQA-Med, where a question can be asked about any diseasefrom any part of the body.In this work, we use the RAD (Lau et al., 2018) and CLEF18 (Ionescuet al., 2018) medical VQA datasets, which are different from the existing VQAdatasets. The obvious reason is their focus on the medical domain, which offersdistinguishing challenges. The images, questions, and answers must be clinicallyrelevant in order to be a part of this dataset which is not a constraint in theVQA datasets.
VQA tasks are primarily based on three key components: generating rep-resentations of images and questions; passing these inputs through a neuralnetwork to produce a co-dependent embedding, and then generating the correct8esponse. Fig 1 illustrates this framework where the key components can take awide variety of forms.
Fig 1.
Framework for VQA where, question and image are taken as input to generate orpredict the answer.
VQA systems differ from each other in the way they fuse multi-modalinformation. Although most open-ended VQA algorithms used the classificationmechanism, this strategy can only produce answers seen during training. Themulti-word response is generated one word at a time using LSTM (Gao et al.,2015; Malinowski et al., 2015). The response generated, however, is still restrictedto words seen in the course of training.For question encoding, most methods for VQA uses a variant of RecurrentNeural Network (RNN) (Mikolov et al., 2010). RNNs are capable of handlingsequence problems, but when RNN processes lengthy sequences, context datais easily ignored. LSTM’s (Greff et al., 2016) proposal mitigated the long-distance dependency issue. In addition, the researchers also discovered that therespective route from the decoder to the encoder will be reduced if the inputsequence is inverted, contributing to network memory. The Bi-LSTM (Graves& Schmidhuber, 2005) model combines the above two points and improves theresults. The Gated Recurrent Unit (GRU) (Cho et al., 2014) is also notable, andwidely used, simplification of the LSTM. As for the image feature extraction,Convolutional Neural Network (CNN) (Traore et al., 2018) are used where VGG-9et (Simonyan & Zisserman, 2015) and deep residual networks (ResNet) (Heet al., 2016) are the most popular choice.The application of attention on the image can help to improve the performanceof the model by discarding the irrelevant parts of the image. So, attentionmechanisms (Xiong et al., 2016; Yang et al., 2016) are usually incorporated inthe models so that they may learn to attend to the important regions of theinput image. However, attending image is not enough but question attentionis important too as most of the words in the question may be irrelevant sosimultaneous integration of both question and image attention is advised (Luet al., 2016). The fundamental concept behind all these attentive models isthat for answering a specific question, certain visual areas in an image andcertain words in a question provides more information than others. The StackedAttention Network (SAN) (Yang et al., 2016) and the Dynamic Memory Network(DMN) (Xiong et al., 2016) used image features from a CNN feature map’sspatial grid. In (Yang et al., 2016) an attention layer is specified by a single layerof weights using the question and image feature defined to calculate attentiondistribution across image locations. Using a weighted sum, this allocation isthen applied to the CNN feature map to pool across spatial feature locations.It creates a global representation of the image that highlights certain spatialregions.VQA depends on the image and question being processed together. This wasachieved earlier by using simplified methods such as concatenation or element-wise product, but these methods fail to capture the complex interactions betweenthese two modalities. Later, multi-modal bi-linear pooling was proposed wherethe idea was to approximate the outer product between the two features, enablinga much deeper interaction between them. Similar concepts have been shownto work well to improve the fine-grained image recognition (Lin et al., 2017).Multimodal Compact Bilinear (MCB) (Fukui et al., 2016) is the most significantVQA technique used in bilinear pooling. It calculates the outer product in a10educed dimensional space instead of explicit calculation to minimize the numberof parameters to be learned. Then this is used to predict the relevant spatialfeatures according to the question. The major change was the use of MCB forfeature fusion instead of element-wise multiplication.Methods for Medical-VQA must be different from general VQA as the sizeof the datasets is incomparable. The other challenge with Medical-VQA is tobalance the number of image features (usually thousands) with the numberof clinical features (usually just a few) in the deep learning network to avoiddrowning out of the clinical features. Attention-based on bounding box too cannotbe applied directly as medical images lack the bounding box information. Formedical imaging, there are many computer-aided diagnostic systems (Kawahara& Hamarneh, 2016; van Tulder & de Bruijne, 2016; Tarando et al., 2016). Mostof them, however, deal with single disease problems and focused primarily oneasily identifiable areas such as the lungs and skin. In contrast to these systems,Medical-VQA deals with multiple diseases at the same time apart from handlingmultiple body parts which are difficult for machines to learn.Recently, the ImageCLEF introduced the challenge of Medical Domain VisualQuestion Answering, VQA-Med 2018 (Ionescu et al., 2018). The system sub-mitted by Peng et al. (2018) achieved the best performance (in terms of BLEUscore) in VQA-Med 2018 for medical visual question answering. They builttheir best performing systems using ResNet-152 for image feature extraction andMultimodal Factorized High-order (MFH) Yu et al. (2018) for language-visionfusion. Zhou et al. (2018) utilized the Inception-Resnet-v2 and Bi-LSTM forimage and question representation, respectively. They used the inter-attentionmechanism to fuse the language and vision features. Their best performingsystem stood second among all the submitted systems in the challenge. Thethird best system submitted by Abacha et al. (2018) uses the pre-trained VGG-16 of VQA-Med, Yan et al. (2019) submitted the best system formedical visual question answering. The proposed approach utilized the BERT(Devlin et al., 2018) for question representation and pre-trained VGG-16 modelfor image representation. They fused the question and image features usingMFB mechanism.Inherently, questions follow a temporal sequence and naturally cluster intodifferent types. This question-type information is very important to predict theresponse regardless of the image. Authors in (Kafle & Kanan, 2016) use a similarapproach where they first identify the question-type and use this informationfor answer generation. Our work, however, isolates the learning path basedon question-type rather than using this knowledge as a feature. This type ofinformation can also affect model performance as some of the VQA modelsperform better than others for certain types of questions. Therefore, thesemodels can be intelligently combined to leverage their varied strengths. Wepropose a simple model with a question segregation module which segregatesthe learning path based on the question types ( Yes/No and
Others ) to reap thebenefits of question-type dedicated models. We use Inception-Resnet to encodeimage feature and Bi-LSTM for question feature creation.
3. Materials and Methods
Given a pair (
Q, I ), where Q is the textual question accompanied by anymedically relevant image I , the Medical-VQA task is aimed to generate the A . Mathematically, it can be formulated as, A = f ( Q, I, α ) (1)where f is the answer prediction function and α denotes the model parameters.Questions in Q can be categorized into two question types ( q type ). For questionswith q type = Y es/N o , the input Q have a straightforward binary response.While for questions with q type = Others , it can have a well-thought-out variablelength response generated from the answer dictionary words. The problem is todevelop a hierarchical model with a question segregation module to differentiatethe learning path for the two q type for solving the generic Medical-VQA task.
Our proposed approach towards the solution of the problem is to form atwo-level hierarchy, where the top-level task is question segregation and thenext-level task is answer prediction. The proposed hierarchical model is depictedin Fig 2. The subsequent sections describe the components of the proposedhierarchy.
Fig 2.
Abstract representation of the proposed hierarchical HQS-VQA model. The first levelsegregates the questions while the second level generates the answer using the leaf node.Answer prediction strategy is decided based on the question type.
Question Segregation, in general, segregates the questions from a questionlist Q = [ q , q . . . q n ] based on the q type , where n denotes the total numberof questions. In this list, q i = “ w w . . . w m ” denotes the i th question in13he list containing a sequence of m words. The task of question segregation isrelatively an easier task compared to answer prediction. Thus, we find a simplestatistical machine learning model based on simple hand-engineered, and wordfrequency-based features to effectively solve the problem poised in the top-tier(segregation) of our proposed hierarchical setup. We employ an SVM classifierfor this purpose. The SVM is relatively less complex compared to the deepneural network. Moreover, the datasets, which we use here are relatively smaller,and hence deep learning models tend to perform on the lower-side compared tothe classical supervised SVM based model. Fig 3 illustrates the entire questionsegregation process.. Fig 3.
Proposed question segregation module with linear SVM learner as base classifier. Theextracted feature vectors are fed to the SVM for question segregation.
The classifier, and input to QS module i.e. question feature vectors generatedfrom the questions in the dataset are explained as follows:
Question Feature Vectors:
From each question, we extract the following twovectors: • Question Identifier Vector:
We form a set of r question identifier words,where each word tries to represent the question motive. We then convertevery question into a question vector V = [ v , v . . . v r ], where v i ∈ { , } indicates the presence or absence of the i th question identifier word in thequestion. • Tf-idf Vector:
We use tf-idf (term frequency - inverse document fre-quency) to assess how significant w j is to a q i in Q . It can be calculated14s: tf - idf ( w j , q i ) = f ( w j , q i ) (cid:80) mj =1 f ( w j , q i ) ∗ log e n (cid:80) ni =1 f ( w j , q i ) (2)where f ( w j , q i ) is the frequency of w j in q i , n is the number of words inquestion q i . From the entire vocabulary, we consider only top m (cid:48) wordswith the highest tf-idf values. We then convert every q i in the list Qinto tf-idf vector such that position of every w j in q i is represented by tf - idf ( w j , q i ).We concatenate both the feature vectors to represent a question. Question Classifier:
The SVM (Cortes & Vapnik, 1995; Cristianini et al.,2000) is a statistical classification technique and inspired by it’s performancein (Wang et al., 2018; Zhi et al., 2018). we use SVM learner as the baseclassifier. It takes the question feature vectors as input during the training stageto segregate the questions according to its type. It is a linear function whichcan be represented as, f ( v i ) = (cid:104) v i , w T (cid:105) + b, where (cid:104) v i , w T (cid:105) = || v i || || w T || cos ( θ ) (3)where v i , w T , and b are feature vector of i th question, weight, and bias respectively.So, ∀ i ; i ∈ [1 , n ], either of the following equations can be true based on q type . (cid:104) v i , w T (cid:105) + b ≥ , if ( q type == Y es/N o ) (4)or (cid:104) v i , w T (cid:105) + b ≤ − , if ( q type == Others ) (5)We use the SVM with linear kernel, and use ‘ hinge ’ as the loss function. Thehinge loss (cid:96) ( y ) of the prediction y = f ( v i ) for a true q type t and a classifierscore y is defined as, (cid:96) ( y ) = max (0 , − t ∗ f ( v i )) (6)15 .2.2. Answer Prediction Answer prediction component is at the second level of our hierarchy, wheretwo separate models M = { m , m } deal with the problem of answer generationat the lowest level nodes, i.e. leaf nodes. While m = Y es/N o deals with theproblem of producing simple
Yes/No answers from simple incoming queriesat the first leaf node, m = Others , on the other hand, deals with complexqueries to produce other expertise answers at the second leaf node. We extractthe question and image feature vectors which are then fused together. Basedon q type this model passes the fused vector through several layers to finallygenerate the answer. We outline these tasks with more details in the followingsubsections.
Question Feature Extraction:
We first pre-process the questions by convert-ing the words into lowercase and then lemmatizing them to reduce the ambiguityamong their different forms. Next we remove words like ‘ the ’, ‘ and ’, ‘ with ’ etc.to discard useless information. We then map pure numbers to ‘ num ’ token andalphanumeric words to ‘ pos ’ token to minimize the complexity of informationin questions. We then generate the integer sequence from the pre-processedquestions that are finally fed to the embedding layer together with the wordembedding for the extraction of question feature ( F Q ) of dimension m × d where m is the total number of words in the question and d denotes the dimension ofthe word embedding vector. Fig 4 illustrates the entire process.The components of the question feature extraction process are describedbelow: • Word embedding:
We use word embedding to vectorize words to capturetheir meaning. For this we first generate d dimensional vectors G =[ g , g . . . g d ] using the GloVe (Pennington et al., 2014) vector. We alsointroduce the sub-word embedding to capture the embedding of unknownword in medical terminology. For sub-word embedding, we follow the16 ig 4. Flowchart of generating question embedding. The word embedding is theconcatenation of GloVe and Custom (sub-word) embedding, which are used along with theinteger sequence representation of the questions by the embedding layer. (Bojanowski et al., 2017) work on FastText vector and generate the sub-word embedding vector of dimension d as C = [ c , c . . . c d ]. Wenext concatenate the embeddings to create the final d -dimensional wordembedding vector E = [ g , g . . . g d , c , c . . . c d ]. Image Feature Extraction:
For image feature extraction, we use the Inception-Resnet-v2 model. It is a type of advanced CNN that integrates the inceptionmodule (Szegedy et al., 2014) with ResNet. Inception enables one to accomplisha very good performance at comparatively low computational costs, while resid-ual connections considerably speed up network training by enabling connectionshortcuts. Together they allow the development of deeper and wider networks inthe inception-resnet-v2 model. Basically, the network utilizes residual links (Heet al., 2016) (Fig 5) to combine filters of varying dimensions, which not onlyprevent the issue of degradation caused by deep structures but also decreasesthe training cost.
Fig 5.
Canonical form of a 2-layer ResNet block. Layer-2 is skipped over activation fromlayer-1 using residual link. × F I ). The features generated are of size 1000. Question and Image Feature Fusion:
Unidirectional LSTM layer helps tocapture the sequence information in the question, but it can only retain priorinformation as it has only seen the past inputs. In bidirectional layer inputsrun bidirectionally in two ways, one from the past to the future and vice-versa.Therefore, before bi-modal feature fusion, we feed the extracted question featurevector F Q to the bidirectional layer with LSTM as input for the recurrentinstance. The process to generate the representation of question by Bi-LSTM isdepicted in Fig 6.It helps to preserve the information from both past and future. We useit with sequences returned, so that LSTM hidden layer returns a sequenceof values one per time-step instead of returning a single value for the entiresequence, such that −→ H = [ −→ h , −→ h . . . −→ h m ], and ←− H = [ ←− h m , ←− h m − . . . ←− h ].Here, for forward and backward directions, −→ H , and ←− H are the sequence ofhidden state outputs, while −→ h i , and ←− h i are the hidden state outputs at i th time-step. The final Bi-LSTM output is then ←→ H = [ ←→ h , ←→ h . . . ←→ h m ], with ←→ h i = ←→ h i (cid:12) ←→ h m − i +1 ; ∀ i ∈ [1 , m ], where (cid:12) denotes the concatenate operator.To minimize the problem of overfitting due to small amount of training data wealso used dropout value of 50% in the this layer.18 ig 6. The processing of a question by Bi-LSTM to get the question representation. Theword embedding of each word in the question is fed to a Bi-LSTM network, and the forwardand backward hidden state outputs are concatenated at each time-step to get the finalrepresentation of the question.
We feed F I to the RepeatVector layer to make it’s dimension same as F Q for the modeling convenience. We then concatenate the repeated F I , and ←→ H for fusion. We finally feed the output to Batch-Normalization(Ioffe & Szegedy,2015) layer for regularization and to increase the stability of the network. Thus,the normalized fused feature ( F ) is: F = BN (( Bidirectional ( LST M ( F Q ))) ⊕ ( RepeatV ector ( m )( F I ))) (7)where, BN and ⊕ represents Batch-Normalization and concatenation respectively.The process of feature fusion is illustrated in Fig 7. Fig 7.
The architecture represents the fusion of the two modalities’ feature vectors usingconcatenate layer, the output of which is normalized to generate the fused feature vector.
Answer Prediction - Yes/No:
We treat model m (c.f. Section 3.2.2) as atwo-class classification model. To generate the answer, we flatten the normalized RepeatVector(n) replicates the input feature vector n times. F ) to generate a single long fused feature ( F (cid:48) ) which we passthrough the fully-connected layer with two output neurons. We formulate theprediction procedure to predict the answer using Softmax as the activationfunction in the fully-connected layer as,ˆ a = P ( a i | F (cid:48) , W , b ) = sof tmax ( F (cid:48) W i + b i )= e F (cid:48) W i + b i e F (cid:48) W Y es + b Y es + e F (cid:48) W No + b No , i ∈ { Y es, N o } (8)where, ˆ a is the prediction probability of selecting the i th answer word ( a i )given F (cid:48) bias ( b i ), and weight matrix W i ( i ∈ { Y es, N o } ). We use categoricalcross entropy as loss function having the following formula. L ( a, ˆ a ) = − ( a Y es ∗ log(ˆ a Y es ) + a No ∗ log(ˆ a No )) (9)where, a i and ˆ a i denote the actual and predicted probability, respectively, ofselecting ‘ Y es/N o ’ as answer.
Answer Prediction - Others:
We treat model m (c.f. Section 3.2.2) asa multi-label classification model for which we create a separate word-indexdictionary for answers D a = { w : 1 , w : 2 . . . w z : z } , where z is the totalnumber of unique words in the answer list. We also transform the x th answerlist A ( x ) of answer length r to A (cid:48) ( x ) = [ a (cid:48) , a (cid:48) . . . a (cid:48) r ], where a (cid:48) i is the i th answerwhich is encoded in the form of one-hot vector (a binary vector with values 0and 1). We pass the normalized fused feature ( F (cid:48) ) to the fully-connected layerwith t output neurons to generate F (cid:48)(cid:48) . We formulate the recursive predictionprocedure to predict the answer words using TimeDistributed layer havingSoftmax activation as,ˆ a = P ( a i | F (cid:48)(cid:48) , ˆ a i − , b ) = sof tmax ( F (cid:48)(cid:48) W i + b i ) = e F (cid:48)(cid:48) W i + b i (cid:80) zj =1 e F (cid:48)(cid:48) W j + b j (10) https://keras.io/api/layers/recurrent_layers/time_distributed/ a is the probability of selecting the i th answer word ( a i ) given F (cid:48)(cid:48) , bias( b ), and the set of probabilities of previously predicted answer words (ˆ a i − ), and W i is the weight matrix. z is the number of the words in vocabulary. We usecategorical cross entropy as the loss function as follows: L ( a, ˆ a ) = − z (cid:88) i =1 r (cid:88) j =1 ( a ij ∗ log(ˆ a ij )) (11)In Eq (11), for i th word in the answer sequence of length r , a ij and ˆ a ij denotesthe actual and predicted probability of selecting the j th word of the answerdictionary having z words. In the QS module to create the tf-idf vector, we consider top 500 words withthe highest tf-idf values from the vocabulary in the training set of 2000 words.After studying the training set, we select 10 words (‘ is ’, ‘ was ’, ‘ are ’, ‘ how ’, ‘ can ’,‘ does ’, ‘ which ’, ‘ what ’, ‘ type ’, and ‘ there ’) to form the set of question identifierwords. We use the default values of the rest of the parameters (e.g., the c inthe SVM). For question embedding, we create 600-dimensional word embeddingby concatenating the two 300-dimensional GloVe and FastText embeddings,following an approach similar to the work proposed in (Ghannay et al., 2016).We create a question dictionary of size 1050 to capture the most frequent wordsin the questions. For answers, we create a separate answer dictionary of sizeequal to the count of unique word in the answer list. As a negligible numberof questions is of length greater than 21, we fix the maximum question lengthas 21. For answers having type ‘Others’, we prune the maximum length to 11.However, for Yes/No type answers, the length is 1, as only Yes or No is theprobable answer. Consistency is maintained in the input length by appending‘ blank ’ at the end for the shorter sequences and curbing longer sequences up tothe required length. For the hidden layer of Bi-LSTM, we fix 128 neurons in each21irection. We use the Categorical Accuracy as the metrics to calculate the meanaccuracy rate for multiclass classification problems across all the predictions.We consider a batch of size 256 for training. We set the number of epochs to 251and to optimize the weights during training we use Adam optimizer (Kingma &Ba, 2014). We obtained the optimal hyper-parameters value based on the modelperformance on validation dataset. In our work, the datasets we use for evaluating our proposed model has onlyone reference answer. Thus, reporting high accuracy means generating a highnumber of the answers with exact same words as in the reference answer. Thismay not be necessary at all times and is a very complex task even in the medicaldomain. However, more than one answer may be correct. This is explained withan example given in Table 2.
Question Multiple correct answers
Where is the lung lesion located? • Right lobe • Lower lobe • Right lower lobe
Table 2.
Example where the absence of the degree of specification makes all the answers tothe question are correct.
In this regard, we follow the standard evaluation schemes from the Image-CLEF2018 VQA-Med 2018 challenge. The evaluation metrics are as follows: • BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002):
It stands for and it is a popular evaluation metric in machine translation,which compares the generated answer with the reference answer basedon the number of n-grams of the generated answers that match with the https://github.com/keras-team/keras/blob/master/keras/metrics.py BLEU = BP · exp (cid:18) N (cid:88) n =1 w n log e p n (cid:19) (12)where p n is the modified n-gram precision, BP is the brevity penaltyto penalize short answer, w n is weight between 0 and 1 for log e p n and (cid:80) Nn =1 w n = 1, N is the maximum length of n-gram. BP can be computedas follows: BP = c > r exp (cid:0) − rc (cid:1) if c ≤ r (13)where c is the number of unigrams in all the candidate answers r is thebest match lengths for each candidate answer in the dataset.BLUE score serves as a better evaluation metric in this work, but it is notthat effective when more than one medical term indicates the same partor symptom (e.g. the words ‘ Lung ’, and ‘
Lobe ’ refers to the same organ).As shown in Table 3, all the answers are correct for the question, but theBLEU score decreases as it fails to consider synonyms during evaluation.
Question Multiple correct answers
Where is the lesion located? – Right lobe – Right lung – Right lobe of the lung
Table 3.
Example of the semantically similar answers. Although not all the answers to thequestion are the same, but they are semantically correct. • Word-based Semantic Similarity (WBBS) (Hasan et al., 2018): It isanother evaluation metrics used to assess the performance of the systems To compute modified n-gram precision, all candidate n-gram counts and their correspondingmaximum reference counts are collected. The candidate counts are clipped by their corre-sponding reference maximum value, summed, and divided by the total number of candidaten-grams. Specifically, Wu & Palmer (1994) with WordNet ontology in thebackend. It computes a similarity score between the ground truth an-swer and system-generated answer by considering the word-level semanticsimilarity.We follow the evaluation setup discussed in ImageCLEF2018 VQA-Med 2018challenge overview paper (Hasan et al., 2018) to evaluate the performance ofthe system. Towards this, we first pre-process the predicted and ground-truthanswers and then calculate the scores. For pre-processing, we convert the answersto lower-case, remove the punctuations, and apply tokenization to the individualwords in the answer. We also remove the stopwords in the answer from theNLTK’s English stopword list.We also use the following metrics to evaluate the effectiveness of our AnswerPrediction module for q type = ‘
Y es/N o ’ (c.f. Section 3.2.2): • Precision ( P ): It reflects the fraction of correctly predicted instances ofa class (say c ) from the total number of predicted instances as c .P = |{ instances of c } ∩ { predicted instances as c }||{ predicted instances as c }| (14)We report the macro-averaged precision P m , where C is a set of all thepossible classes, P m = ( (cid:88) c ∈ C P c ) / (cid:107) C (cid:107) (15) • Recall ( R ): It reflects the fraction of correctly predicted instances fromthe total number of actual instances belonging to c .R = |{ instances of c } ∩ { predicted instances as c }||{ instances of c }| (16) https://datasets.d2.mpi-inf.mpg.de/mateusz14visualturing/calculate http://nltk.org/
24e report the macro-averaged recall (R m ) for the set of all classes C similarto Eq (15). • F -score ( F1 ): It is a function of P and R.F1 = 2 ∗ ((P ∗ R) / (P + R)) (17)We report the macro-averaged F1-score similar to Eq (15). • Accuracy ( A ): It reflects the fraction of correctly predicted instancesfrom the total number of instances.A = |{ correctly predicted answers }||{ answers }| (18)
4. Data Description
Datasets for Medical-VQA consists of Natural Language questions aboutthe content of radiography images, and the task is to generate the appropriateanswer. The questions are framed on the different modalities of medical image like ‘angiogram’ , ‘magnetic resonance imaging’ , ‘computed tomography’ , ‘ultrasound’ ,etc. that describes how the image is taken. These images can have differentorientations e.g. ‘sagittal’ , ‘axial’ , ‘longitudinal’ , ‘coronal’ , etc. Along with thevariety in orientation and modalities, images can be of any body part or organsuch as heart , lung , skull , etc. (Fig 8). (a) Brain (b)Breasts (c) Chest (d) CNS (e) AP (f) Axial (g)Coronal(h) PA Fig 8.
Sample images in the Medical-VQA dataset. The images in this dataset can be ofdifferent organs (a to d) and/or modalities (e to h). .1. RAD Dataset RAD dataset is a recently released dataset for VQA in the medical domain.Statistics of the dataset are as follows: • The training set consists of 3 ,
064 question-answer pairs. • The test set consists of 451 question-answer pairs.Some of the images in the dataset are blurred while others contain markingssuch as short information, tags, etc. But none of the images in the datasetcontains a stack of sub-images. The questions are primarily categorized into 11categories viz. abnormality, attribute, color, counting, modality, organ system,other, plane, positional reasoning, and size. The average question length is 5to 7 words which is greater than the answer length. 53% of the answers are of
Yes/No type while rest of them are either one word or short phrase answers.The maximum question length in the dataset is about 21 words, with anaverage of 7 words. It should also be noted that many questions are beingrephrased, which are semantically similar. For example, • What is the size and density of the lesion? • Describe the size and density of this lesion?From the statistical study of the dataset, we find that only 87% of the free-form,and 93% of the rephrased questions are unique, while only 32% answers areunique. More than half of the answers are of
Yes/No type. This is visualized bythe peak in the Fig 9b for one-word answers.
CLEF18 task is similar to RAD task where a semi-automatic approach isused to generate the questions and answers from the captions. It uses the26 a) Questions (b)
Answers
Fig 9.
Word-Frequency distribution in the RAD dataset. The graph demonstrates that forboth the questions and the answers, this distribution is almost similar for train and test datasplits. radiology images and their respective captions extracted from the PubMedCentral articles (essentially a subset of the ImageCLEF2017 caption predictiontask (Eickhoff et al., 2017)). Due to the way the question-answer (QA) pairsare generated, they are diverse and descriptive. The dataset also contains a lotof artificial questions that are semantically invalid. Table 4 demonstrates thecomplexity of a sample question-answer pair from the training data. Question Answer what reveals prominent bilateral enhancing parietaloccipital lesions on flair and t2 sequences and smallareas of hyperintensity in the left periventricular whitematter on diffusion weighted images? mri of the brainwhat does mri in sagital plane show? the collection was superficial to the muscles of theback and the gluteal region but deep to the posteriorlayer of the thoraco lumbar fascia
Table 4.
Sample examples from the CLEF18 training data. Most of the questions and answers are descriptive and complicated in thisdataset as they are generated semi-automatically. • The training set consists of 5413 questions along with their respectiveanswers about 2 ,
278 images. • The validation set consists of 500 questions along with their respectiveanswers about 324 images. • The test set consists of 500 questions about 264 images.Some of the images in the dataset are blurred (Fig 10a) and most of theimages contain radiology markings (Fig 10b) such as short information, tags,arrows, etc. A few of them even consists of a stack of sub-images (Fig 10c). (a)
Blurred (b)
Radiology markings (c)
Stacked
Fig 10.
Sample images in the CLEF18 dataset. Some of the images in this dataset areblurred (hazy/not clear), and/or contains short information in the form of radiology markings,and/or contains stack of sub-images.
Question categorization is not present in the dataset and only 0 .
6% of theanswers in training, 6% in validation, and 10% in test data are of
Yes/No type.From Fig 11, we can see that the average question length is more than answers,and the word frequency distribution is not even in the data splits.
5. Results and Analysis
We compare our proposed methodology with the existing works in VQA-Med.28 a) Questions (b)
Answers
Fig 11.
Word-Frequency distribution in the CLEF18 dataset. The graph shows that for thedata splits, this distribution is not comparable for both the questions and the answers.
Towards this, we use the following baseline models.1. ResNet152 + LSTM + MFH (Peng et al., 2018): We compare our approachwith the best system reported in the ImageCLEF2018 challenge of VQA-Med 2018. They used LSTM to extract the question features, whereas theimage features were extracted from the ResNet152 model pre-trained on theImagenet dataset. For question and image feature fusion, they employedthe co-attention mechanism with MFH to generate the question-imagerepresentation. They predicted the probable words to form the answersusing the multi-label classification and then generated the answers usingsampling.2. Inception-Resnet-v2 + Bi-LSTM + Attention (Zhou et al., 2018): Oursecond baseline model corresponds to the second best system participatedin the ImageCLEF2018 challenge of VQA-Med 2018. Before applyingthe question and image to their proposed model, they performed the pre-processing steps on image and question both. They employed Inception-Resnet-v2 model to extract image features, and Bi-LSTM model to encodethe questions. They utilized the attention mechanism to fuse the imageand question features.3. VGG-16 + LSTM + SAN (Abacha et al., 2018): This is the third best29ystem participated in the ImageCLEF2018 challenge of VQA-Med 2018.They used LSTM to extract the question features, whereas the imagefeatures were extracted from the last pooling layer of VGG-16 pre-trainedon the Imagenet dataset. The stacked attention network (Yang et al., 2016)is used to fuse the question and image features to obtain a single featurerepresentation. These features are used to predict the answers from thegiven answer list, which is compiled from the training dataset.4. VGG16 + BERT + MFB (Yan et al., 2019): We compare the performanceof our proposed model with the best performing system participated in theImageCLEF2019 challenge of VQA-Med 2019. They utilized the BERTmodel to extract the question features, whereas the image features wereextracted from the multiple pooling layers of VGG-16 pre-trained onImagenet dataset. Multi-Modal Factorized Bilinear (MFB) (Yu et al.,2017) pooling were used to fuse the question and image features.
RAD CLEF18 CLEF18+RAD
Yes/No Others Overall Yes/No Others Overall Yes/No Others Overall
Precision
Recall F -score Table 5.
Performance of Question-Segregation model in terms of Precision, Recall, andF -score. Table 5 shows the performance of our QS model on RAD, CLEF18, andCLEF18+RAD dataset in terms of Precision, Recall, and F -score. QS usingSVM shows impressive results on the stated datasets. For CLEF18 dataset Recalland F -score for ‘ Yes/No ’ type question is little less due to the less number ofsuch questions in training example.Table 6 shows the comparison of our proposed approach with the baselinemodels on the datasets in terms of BLEU and WBBS scores. Our proposed30 odel RAD CLEF18 CLEF18+RADBLEU WBSS BLEU WBSS BLEU WBSS
Peng et al. (2018) – – 0 .
161 0 .
184 – –Peng et al. (2018) ∗ .
023 0 .
104 0 .
023 0 .
072 0 .
027 0 . .
134 0 .
173 – –Zhou et al. (2018) $ .
522 0 .
532 0 .
072 0 .
112 0 .
277 0 . .
121 0 .
174 – –Abacha et al. (2018) ∗ .
035 0 .
213 0 .
051 0 .
170 0 .
036 0 . ∗ .
002 0 .
011 0 .
005 0 .
069 0 .
002 0 . .
411 0 .
437 0 .
132 0 .
162 0 .
257 0 . Table 6.
Comparison between the baseline models and our model in terms of BLEU andWBBS scores. Star ( ∗ ) denote the re-implementation of the proposed work with the authorsreported experimental setups. Dollar ( $ ) denote the official implementation of the approachproposed by the author. approach for medical visual question answering achieves the BLEU score of0 .
132 and WBBS score of 0 .
162 on the CLEF dataset. Peng et al. (2018)reports the BLEU score of 0 .
161 on the CLEF dataset. Since there is no officialopen-source implementation available for their system, we re-implemented theapproach with the official experimental setup discussed in the paper, but onlyachieves the BLEU score of 0 .
023 and WBBS score of 0 . . .
104 on the RAD dataset. The improvement (in terms31f WBBS) in the RAD dataset can be understood by the fact that the RADtest set contains questions, which can be answered in a single-word. A similarobservation is made on the results of the CLEF+RAD dataset.Zhou et al. (2018) reported the performance in terms of BLEU (0 . . on the CLEF dataset, and recorded the BLEU andWBBS score of 0 .
072 and 0 . . .
532 on the RAD dataset with the re-implementation of theirapproaches. The reason for significant improvement for the RAD dataset is thatZhou et al. (2018) performed considerable pre-processing on image, question, andanswer and also post-processed the generated answers. In their pre-processingsteps, they adopted image enhancement and reconstructed the images withexceedingly small random rotations, offsets, scaling, clipping, and increase to 20images per image. For questions, they utilized the methods like stemming andlemmatization to alter verbs, nouns, and other words into their original forms.Furthermore, they replaced all the medical terms with their abbreviations, acombination of letters and numbers are replaced with ‘ pos ’ token and the purenumbers are mapped to an ‘ num ’ token. For answers, they used lemmatizationand removed all the stop words. They also replaced all the words associatedwith any number in answer. These words are generally the measures, for e.g.‘ cm ’ in ‘ ’. As a post-processing step, they added several simple rules to thegenerated answers to make these more reasonable. In addition, they also deletedextra prepositions and additional words from the answers of yes/no questions.These additional pre-processing and rule-based post-processing steps make thesystem highly focused on medical VQA and not adaptable to the other domainsof VQA. Additionally, the rules favor the short answers, which are only useful inthe case of answer prediction and may suffer for answer generation. In contrast,our proposed approach achieves better performance for the CLEF dataset and https://github.com/youngzhou97qz/CLEF2018-VQA-Med .
542 and 0 .
553 BLEU and WBBS scores,respectively, on the RAD dataset. The new model achieves the 0 .
110 (0 . .
051 (0 . .
121 on the CLEF dataset.The official implementation of their system is not available. Therefore, were-implemented the approach with the official experimental setup discussed inthe paper, but only achieved the BLEU score of 0 .
051 and WBBS score of0 . . .
213 on the RAD dataset.For CLEF+RAD dataset, we report the BLEU and WBBS scores of 0 . . along with the evaluation script of all there-implementations of the existing works and our proposed approach.We also analyze that, in general, a model performs better if more trainingsamples are present, thus the models must have better scores when trained andtested on the combined dataset. But further analysis of the results reveals thatthe models perform better on the individual datasets than on the combined one. https://bit.ly/3aH6EFm Question AnswerRAD
What solid organ is seen on the rightside of this image? The liver
CLEF18
What shows the dilated common bileduct with a filling defect within itindicating the tumor extending? Magnetic resonanceimaging image ofthe liver
Table 7.
Example of question-answer pair, which shows that question and answer structurein CLEF18 is more complicated than in RAD.
Table 8 demonstrates the impact of QS (c.f. Section 3.2.1) module. Italso reveals that QS improves the model’s performance by a significant marginregardless of the question type. The performance difference is clearly visible inFig 12.
RAD CLEF18 CLEF18+RAD without with without with without with
Yes/No
Table 8.
Result of our model with and without question type segregation, for RAD, CLEF18,and CLEF18+RAD dataset. a) Yes/No (b)
Others (c)
Overall
Fig 12.
Impact of QS on the model performance. it shows that with QS the model performsbetter regardless of the type of question or dataset.
Impact of QS on questions with type ‘
Yes/No ’: For this type of question,the main advantage of QS is that it prevents the model from predicting anyanswer phrases other than a straightforward ‘
Yes ’ or ‘ No ’. While a modelwithout QS can predict any answer word or sequence of answer words that turnsout to be irrelevant. Table 9 shows the model’s predicted responses with andwithout the QS module. Question :
Is the GI tract ishighlighted by con-trast? Is the surroundingphlegmon normal? Were both sides af-fected? Does the PET scanshow abnormaltracer accumulation?
Ans (w/o.) : bilateral bronchiectasis axial internal whorled
Ans (w.) :
Yes Yes Yes No
Ans (GT) :
Yes No Yes No
Table 9.
Comparison of Ground Truth (GT) answer with the predicted answer having question-type
Yes/No by the model with (w.)and without (w/o.) QS. Without QS, our model predicts answer words that are not expected for question in this category.
With QS, the search space is reduced to only two words that is ‘
Yes ’ and ‘ No ’36hile a model without QS will have to unnecessarily predict the answer wordsfrom the entire answer dictionary (a dictionary containing all the possible answerwords). This reduction in search space leads to a better chance of predictingthe right answer. Table 10 demonstrates the effectiveness of QS in our modelfor ‘ Yes/No ’ type questions. The scores show that, for RAD, CLEF18 andCLEF18+RAD datasets, QS improves precision (Eq. 14), recall (Eq. 16) andf1-score (Eq. 17) by 0 .
3, 0 .
1, 0 . .
3, 0 .
5, 0 . RAD CLEF18 CLEF18+RAD w/o. w. w/o. w. w/o. w. precision
Table 10.
Performance of our model with (w.) and without (w/o.) QS module for Y/N typequestions. sample-imgs/chest
Impact of QS on questions with type ‘
Others ’: A model without QScan predict ‘
Yes ’ or ‘ No ’ as an answer which is completely irrelevant for thequestions with type ‘ Others ’. But the quality of prediction increases when QSis integrated with the same model. Table 11 includes several such instanceswhere the model with QS fails to predict the answer correctly but produces ananswer that is applicable to the question-image pair and is more acceptable thanstraightforward ‘
Yes ’ or ‘ No ’. On the outputs generated by our model, we conduct thorough error analysisand classify the main sources of errors in the following types:1.
Semantic Error:
This type of error occurs when the system predicts theanswer words that are semantically comparable but fails to predict theexact same words as the ground-truth answer.37 uestion : (1) The image istaken in what plane? (2) Where are the in-farcts? (3) Is this patientmale or female? (4) What does themri show?
Ans (w/o.) :
Yes No No No in
Ans (w.) : pa in right chest mass
Ans (GT) : axial basal ganglia female tumor
Table 11.
Comparison of Ground Truth (GT) answer (Ans) with the predicted answer having question-type
Yes/No by the modelwith (w.) and without (w/o.) QS. Without QS, our model predicts Yes or No along with other answer words that are not expected forquestion in this category. Modality/Plane Confusion:
This type of error specifically occurs forquestions that require identification of the modality/plane. For suchquestions, the system fails to identify whether only the plane, modality, ora prediction of modality subtype is sufficient, or the question requires apossible combination of these.3.
Specification Error:
When the question itself fails to specify how muchinformation is desired in the answer, this type of error occurs where morethan one correct answer is possible.4.
Boundary Loss:
This type of error occurs when the system predicts thecorrect answer but does not predict the unimportant ground truth answerwords that the question itself can determine.5.
Miscellaneous Error:
This type of error occurs when the system predictsthe information needed to answer the question, but fails to analyze theinformation collected to answer the question.Table 12, Table 13, and Table 14 include a variety of examples along with38 emanticError:Question :
Where do you seeacute infarcts? How is the patientoriented? How was this imagetaken? where is the lesion lo-cated?
Ans(GT) :
R frontal lobe Posterior-Anterior T2-MRI Lower lobe of theright lung
Ans(pred.) : right frontal lobe pa mri t2 weighted right lower lobe
Comments :
R refers to Right. pa is acronym forPosterior-Anterior. T2-MRI denotes mri(t2-weighted). Semantically sameanswer with differentsentence structure.
Modality/PlaneConfusion:Question :
What imagingmodality is this? What kind of imageis this? What type of imageis this? What type of imageis this?
Ans(GT) :
Sagittal view of t2weighted mri X-ray Plain film x-ray CT with contrast
Ans(pred.) : mri t2 weighted axial x ray x ray ct
Comments :
Identification ofplane is not requiredas per the question. Predicted answer ismore precise. GT answer is moreprecise. GT answer consistsof modality and it’ssubtype information.
Table 12.
Error analysis (Semantic Error and Modality/Plane confusion): Comparison of Ground Truth answer with the predictedanswer for understanding the error types. pecificationError:Question : What does the ctscan of the chestshow? What lobe of thebrain is the lesion lo-cated in? What is the locationof the lesion? Where are the lesionsformed?
Ans(GT) :
A large mass Right frontal lobe Right lower laterallung field Mediastinum andHilum of the rightlung
Ans(pred.) : mass right lobe lower lung in right lung
Comments :
Predicted answerdoes not specify theamount of mass. Predicted answerlacks image-planspecification. GT answer is morespecific in terms of le-sion location. GT answer specifi-cally defines the le-sion formation.
BoundaryLoss:Question :
In what plane wasthis image taken? What lobe of thebrain is the lesion lo-cated in? Is the spleen present? What kind of imageis this?
Ans(GT) :
Axial plane Right frontal lobe On the patient’s left T2 weighted mri
Ans(pred.) : axial right frontal left t2 weighted
Comments : ‘plane’ is uninforma-tive word in the GTanswer. Uninformative word ‘lobe’ in the GT an-swer. GT answer consistsadditional and unin-formative words. Predicted answerlacks modalityinformation.
Table 13.
Error analysis (Specification Error and Boundary Loss): Comparison of Ground Truth answer with the predicted answer forunderstanding the error types. iscellaneousError:Question : Is this an MRI or aCT scan? Is this patient maleor female? Where are the lesionsfound? What does the ct pul-monary angiogramshow?
Ans(GT) :
MRI Female In both lungs Massive filling defect
Ans(pred.) : brain chest in right large defect
Comments :
Question demandsmodality identifica-tion but organ isidentified and pre-dicted. Question demandsgender identificationfor which chest anal-ysis is required Partial prediction. Type of defect is notidentified.
Table 14.
Error analysis (Miscellaneous Error): Comparison of Ground Truth answer with the predicted answer for understanding theerror types. the justifications to better understand each of these error types. To quantifyeach error types, we randomly choose 100 incorrectly generated samples fromthe CLEF+RAD dataset and categorize them into the five error types. Wequantitatively analyze the errors and found that 20 .
54% errors belong to
Modal-ity/Plane Confusion and 14 .
16% errors belong to
Semantic Error . Similarly, wefound 16 .
48% errors fall into the
Specification Error type.
Boundary Loss typeerror contribute to the 20 .
49% of the total errors. The remaining errors (28 . Miscellaneous Error type.41 . Conclusion
In this paper, we propose a hierarchical multi-modal approach to tacklethe VQA problem in the medical domain. In particular, we use a questionsegregation module at the top level of our hierarchy to divide the input questionsinto two different types (‘
Yes/No ’ and ‘
Others ’), followed by individual andindependent models at the leaf level, each dedicated to the type of questionsegregated at the previous level. Our proposed approach can be applied to anyrelated problem where such segregation is possible but it does require non-trivialchanges in the architecture. To evaluate the usefulness of our proposed model,we conduct experiments on two different datasets, RAD and CLEF18. We alsoperform experiments on the combined data of the above two datasets to showthe generalisability of our approach. Models, when trained with the proposedhierarchy with QS, scored better, outperforming all the stated baseline models.It suggests that questions with different types learn better in isolation havingtheir individual learning paths. Experimental results indicate the effectivenessof our work, depicting its value for the VQA in the medical domain. We alsofind out that even simple versions of our model are competitive.Further analysis of the obtained results reveals that the evaluation metricneeds improvement to evaluate VQA in the medical domain. For future work, weplan to investigate a better evaluation strategy for evaluating the task apart fromdevising a detailed scheme for QS. We also plan to introduce better individualmodels to handle each of the leaf node tasks.
Acknowledgment
Asif Ekbal gratefully acknowledges the Young Faculty Research Fellowship(YFRF) Award supported by the Visvesvaraya PhD scheme for Electronics andIT, Ministry of Electronics and Information Technology (MeitY), Government42f India, and implemented by Digital India Corporation (formerly Media LabAsia).
References
Abacha, A. B., Gayen, S., Lau, J. J., Rajaraman, S., & Demner-Fushman, D.(2018). Nlm at imageclef 2018 visual question answering in the medical domain.In
CLEF (Working Notes) .Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., &Parikh, D. (2015). VQA: Visual Question Answering. In
Proceedings of theIEEE international conference on computer vision (pp. 2425–2433).Arai, K., & Kapoor, S. (2019).
Advances in Computer Vision: Proceedings ofthe 2019 Computer Vision Conference (CVC) volume 2. Springer.Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vec-tors with Subword Information.
Transactions of the Association for Computa-tional Linguistics , , 135–146. URL: . doi: .Bradley, E., Zeynettin, A., Jiri, S., & Panagiotis, K. (2017). Data FromLGG-1p19qDeletion. URL: DOI:https://doi.org/10.7937/K9/TCIA.2017.dwehtz9v .Chen, D., Bolton, J., & Manning, C. D. (2016). A thorough examination of thecnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858 ,.Chen, D., Mao, Y., & Zhou, J. (2019). Constructing medical image domainontology with anatomical knowledge. In (pp. 1750–1757). IEEE.Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using43NN Encoder–Decoder for Statistical Machine Translation. In
Proceedings ofthe 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP) (pp. 1724–1734). Doha, Qatar: Association for ComputationalLinguistics. URL: . doi: .Cid, Y. D., Liauchuk, V., Kovalev, V., & M¨uller, H. (2018). Overview ofImageCLEFtuberculosis 2018-Detecting Multi-drug Resistance, ClassifyingTuberculosis Type, and Assessing Severity Score. In
CLEF2018 WorkingNotes, CEUR Workshop Proceedings, Avignon, France .Clark, A. T., Megerian, M. G., Petri, J. E., & Stevens, R. J. (2018). QuestionClassification and Feature Mapping in a Deep Question Answering System.US Patent 9,911,082.Cortes, C., & Vapnik, V. (1995). Support-Vector Networks.
Machine learning , , 273–297.Cristianini, N., Shawe-Taylor, J. et al. (2000). An Introduction to Support VectorMachines and Other Kernel-based Learning Methods . Cambridge universitypress.Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-trainingof deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805 , .Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou,M., & Hon, H.-W. (2019). Unified Language Model Pre-training for NaturalLanguage Understanding and Generation. In
Advances in Neural InformationProcessing Systems (pp. 13042–13054).Eickhoff, C., Schwall, I., de Herrera, A. G. S., & M¨uller, H. (2017). Overview ofImageCLEFcaption 2017-Image Caption Prediction and Concept Detectionfor Biomedical Images. In
CLEF (Working Notes) .44u, Z. (2019). An Introduction of Deep Learning Based Word RepresentationApplied to Natural Language Processing. In (pp.92–104). IEEE.Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach,M. (2016). Multimodal Compact Bilinear Pooling for Visual QuestionAnswering and Visual Grounding. In
Proceedings of the 2016 Confer-ence on Empirical Methods in Natural Language Processing (pp. 457–468).Austin, Texas: Association for Computational Linguistics. URL: . doi: .Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). AreYou Talking to a Machine? Dataset and Methods for Multilingual ImageQuestion Answering. In
Proceedings of the 28th International Conference onNeural Information Processing Systems - Volume 2
NIPS’15 (pp. 2296–2304).Cambridge, MA, USA: MIT Press. URL: http://dl.acm.org/citation.cfm?id=2969442.2969496 .Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S. C., Wang, X., & Li, H. (2019).Dynamic Fusion with Intra-and Inter-modality Attention Flow for VisualQuestion Answering. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (pp. 6639–6648).Gebhardt, E., & Wolf, M. (2018). Camel Dataset for Visual and Thermal InfraredMultiple Object Detection and Tracking. In (pp.1–6). IEEE.Ger, R. B., Yang, J., Ding, Y., Jacobsen, M. C., Cardenas, C. E., Fuller, C. D.,& Howell, R. M. (2018). Data from Synthetic and Phantom MR Images forDetermining Deformable Image Registration Accuracy (MRI-DIR). URL:
TheCancerImagingrcive.DOI:10.7937/K9/TCIA.2018.3f08iejt .45hannay, S., Favre, B., Esteve, Y., & Camelin, N. (2016). Word EmbeddingEvaluation and Combination. In
LREC (pp. 300–305).Graves, A., & Schmidhuber, J. (2005). Framewise Phoneme Classification withBidirectional LSTM and other Neural Network Architectures.
Neural networks: the official journal of the International Neural Network Society , , 602–10.doi: .Greff, K., Srivastava, R. K., Koutn´ık, J., Steunebrink, B. R., & Schmidhuber,J. (2016). LSTM: A Search Space Odyssey. IEEE transactions on neuralnetworks and learning systems , , 2222–2232.Guo, J., He, H., He, T., Lausen, L., Li, M., Lin, H., Shi, X., Wang, C., Xie, J., Zha,S. et al. (2020). Gluoncv and Gluonnlp: Deep Learning in Computer Visionand Natural Language Processing. Journal of Machine Learning Research , ,1–7.Gupta, D., Ekbal, A., & Bhattacharyya, P. (2019). A deep neural networkframework for english hindi question answering. ACM Transactions on Asianand Low-Resource Language Information Processing (TALLIP) , , 1–22.Gupta, D., Kumari, S., Ekbal, A., & Bhattacharyya, P. (2018a). MMQA:A Multi-domain Multi-lingual Question-Answering Framework for Englishand Hindi. In N. C. C. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi,K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk,S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation (LREC 2018) . Paris,France: European Language Resources Association (ELRA).Gupta, D., Lenka, P., Ekbal, A., & Bhattacharyya, P. (2018b). Uncovering code-mixed challenges: A framework for linguistically driven question generationand neural based question answering. In
Proceedings of the 22nd Conferenceon Computational Natural Language Learning (pp. 119–130).46upta, D., Pujari, R., Ekbal, A., Bhattacharyya, P., Maitra, A., Jain, T., &Sengupta, S. (2018c). Can Taxonomy Help? Improving Semantic QuestionMatching using Question Taxonomy. In
Proceedings of the 27th InternationalConference on Computational Linguistics (pp. 499–513). Association for Com-putational Linguistics. URL: http://aclweb.org/anthology/C18-1042 .Hasan, S. A., Ling, Y., Farri, O., Liu, J., Lungren, M., & M¨uller, H. (2018).Overview of the imageclef 2018 medical domain visual question answeringtask. In
CLEF2018 Working Notes. CEUR Workshop Proceedings, Avignon,France (September 10-14 2018) .He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for ImageRecognition. In
Proceedings of the IEEE conference on computer vision andpattern recognition (pp. 770–778).Hersh, W. R., & Bhupatiraju, R. T. (2003). TREC GENOMICS Track Overview.In
TREC .Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep NetworkTraining by Reducing Internal Covariate Shift. In
Proceedings of the 32NdInternational Conference on International Conference on Machine Learning -Volume 37
ICML’15 (pp. 448–456). JMLR.org. URL: http://dl.acm.org/citation.cfm?id=3045118.3045167 .Ionescu, B., M¨uller, H., Villegas, M., de Herrera, A. G. S., Eickhoff, C., Andrea-rczyk, V., Cid, Y. D., Liauchuk, V., Kovalev, V., Hasan, S. A. et al. (2018).Overview of ImageCLEF 2018: Challenges, datasets and evaluation. In
Inter-national Conference of the Cross-Language Evaluation Forum for EuropeanLanguages (pp. 309–334). Springer.Kafle, K., & Kanan, C. (2016). Answer-Type Prediction for Visual QuestionAnswering. In (pp. 4976–4984). doi: .47afle, K., Shrestha, R., Cohen, S., Price, B., & Kanan, C. (2020). AnsweringQuestions about Data Visualizations using Efficient Bimodal Fusion. In
TheIEEE Winter Conference on Applications of Computer Vision (pp. 1498–1507).Kawahara, J., & Hamarneh, G. (2016). Multi-Resolution-Tract CNN with HybridPretrained and Skin-Lesion Trained Layers. In
MLMI@MICCAI .Kingma, D., & Ba, J. (2014). Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations , .Lau, J. J., Gayen, S., Abacha, A. B., & Demner-Fushman, D. (2018). A Datasetof Clinically Generated Visual Questions and Answers about Radiology Images.
Scientific data , , 180251.Li, X., Grandvalet, Y., Davoine, F., Cheng, J., Cui, Y., Zhang, H., Belongie, S.,Tsai, Y.-H., & Yang, M.-H. (2020). Transfer Learning in Computer VisionTasks: Remember where you come from. Image and Vision Computing , ,103853.Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar,P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer.Lin, T.-Y., RoyChowdhury, A., & Maji, S. (2017). Bilinear Convolutional NeuralNetworks for Fine-grained Visual Recognition.
IEEE transactions on patternanalysis and machine intelligence , , 1309–1322.Liu, C., Chen, L.-C., Schroff, F., Adam, H., Hua, W., Yuille, A. L., & Fei-Fei, L.(2019). Auto-deeplab: Hierarchical Neural Architecture Search for SemanticImage Segmentation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (pp. 82–92).Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietik¨ainen, M.(2020). Deep Learning for Generic Object Detection: A survey.
Internationaljournal of computer vision , , 261–318.48ong, M., Zhu, H., Wang, J., & Jordan, M. I. (2017). Deep Transfer Learningwith Joint Adaptation Networks. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 (pp. 2208–2217). JMLR. org.Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical Question-image Co-attention for Visual Question Answering. In
Advances In Neural InformationProcessing Systems (pp. 289–297).Magnuson, J. S., You, H., Luthra, S., Li, M., Nam, H., Escabi, M., Brown, K.,Allopenna, P. D., Theodore, R. M., Monto, N. et al. (2020). EARSHOT: AMinimal Neural Network Model of Incremental Human Speech Recognition.
Cognitive Science , .Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask Your Neurons: ANeural-based Approach to Answering Questions about Images. In Proceedingsof the IEEE international conference on computer vision (pp. 1–9).Mikolov, T., Karafi´at, M., Burget, L., ˇCernock`y, J., & Khudanpur, S. (2010). Re-current Neural Network based Language Model. In
Eleventh annual conferenceof the international speech communication association .Morante, R., Krallinger, M., Valencia, A., & Daelemans, W. (2013). MachineReading of Biomedical Texts about Alzheimer’s Disease.
CEUR WorkshopProceedings , .Mukuze, N., Rohrbach, A., Demberg, V., & Schiele, B. (2018). A Vision-groundedDataset for Predicting Typical Locations for Verbs. In Proceedings of theEleventh International Conference on Language Resources and Evaluation(LREC 2018) .Ningthoujam, D., Yadav, S., Bhattacharyya, P., & Ekbal, A. (2019). Relationextraction between the clinical entities based on the shortest dependency pathbased lstm. arXiv preprint arXiv:1903.09941 , .Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A Methodfor Automatic Evaluation of Machine Translation. In
Proceedings of the 40th nnual meeting on association for computational linguistics (pp. 311–318).Association for Computational Linguistics.Peng, Y., Liu, F., & Rosen, M. P. (2018). Umass at imageclef medical visualquestion answering (med-vqa) 2018 task. In CLEF (Working Notes) .Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors forWord Representation. In (pp. 1532–1543).Ruder, S. (2019).
Neural Transfer Learning for Natural Language Processing .Ph.D. thesis NUI Galway.Ruder, S., Peters, M. E., Swayamdipta, S., & Wolf, T. (2019). Transfer Learningin Natural Language Processing. In
Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics:Tutorials (pp. 15–18).Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M. et al. (2015). ImageNet Large ScaleVisual Recognition Challenge.
Int J Comput Vis , , 211–252.Shaimaa, B., Olivier, G., Sebastian, E., Kelsey, A., Mu, Z., Majid, S., Hong, Z.,Weiruo, Z., Ann, L., Michael, K., Joseph, S., Andrew, Q., Daniel, R., Sylvia,P., & Sandy, N. (2017). Data for NSCLC Radiogenomics Collection. URL: http://doi.org/10.7937/K9/TCIA.2017.7hs46erv .Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networksfor Large-Scale Image Recognition. In International Conference on LearningRepresentations .Sun, X., Xv, H., Dong, J., Zhou, H., Chen, C., & Li, Q. (2020a). Few-shot Learn-ing for Domain-specific Fine-grained Image Classification.
IEEE Transactionson Industrial Electronics , . 50un, Y., Xue, B., Zhang, M., Yen, G. G., & Lv, J. (2020b). Automatically Design-ing CNN Architectures Using the Genetic Algorithm for Image Classification.
IEEE Transactions on Cybernetics , .Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4,Inception-Resnet and the Impact of Residual Connections on Learning. In
Thirty-First AAAI Conference on Artificial Intelligence .Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan,D., Vanhoucke, V., & Rabinovich, A. (2014). Going Deeper with Convolutions. ,(pp. 1–9).Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway,M. B., & Liang, J. (2016). Convolutional Neural Networks for Medical ImageAnalysis: Full Training or Fine Tuning?
IEEE transactions on medicalimaging , , 1299–1312.Tarando, S. R., Fetita, C., Faccinetto, A., & Brillet, P.-Y. (2016). Increasing CADSystem Efficacy for Lung Texture Analysis using a Convolutional Network. In Medical Imaging 2016: Computer-Aided Diagnosis (p. 97850Q). InternationalSociety for Optics and Photonics volume 9785.Traore, B. B., Kamsu-Foguem, B., & Tangara, F. (2018). Deep ConvolutionNeural Network for Image Recognition.
Ecological Informatics , , 257–268.Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,M. R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almi-rantis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artieres, T., Ngonga,A., Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopou-los, I., & Paliouras, G. (2015). An Overview of the BIOASQ Large-scaleBiomedical Semantic Indexing and Question Answering Competition. BMCBioinformatics , , 138. URL: . doi: .51allieres, M., Kay-Rivest, E., Perrin, L. J., Liem, X., Furstoss, C., Khaouam,N., Nguyen-Tan, P. F., Wang, C.-S., & Sultanem, K. (2017). Data fromHead-Neck-PET-CT. URL: TheCancerImagingArchive.doi:10.7937/K9/TCIA.2017.8oje5q00 .van Tulder, G., & de Bruijne, M. (2016). Combining Generative and Discrim-inative Representation Learning for Lung CT Analysis With ConvolutionalRestricted Boltzmann Machines.
IEEE Transactions on Medical Imaging , ,1262–1272. doi: .Wang, H., Zheng, B., Yoon, S. W., & Ko, H. S. (2018). A Support VectorMachine-based Ensemble Algorithm for Breast Cancer Diagnosis. EuropeanJournal of Operational Research , , 687–699.Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In (pp. 133–138). Las Cruces, New Mexico, USA: Association for Computa-tional Linguistics. URL: .doi: .Xiong, C., Merity, S., & Socher, R. (2016). Dynamic Memory Networks for Visualand Textual Question Answering. In Proceedings of the 33rd InternationalConference on International Conference on Machine Learning - Volume 48
ICML’16 (pp. 2397–2406). JMLR.org. URL: http://dl.acm.org/citation.cfm?id=3045390.3045643 .Yadav, S., Ekbal, A., Saha, S., & Bhattacharyya, P. (2019). A unified multi-taskadversarial learning framework for pharmacovigilance mining. In
Proceedingsof the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5234–5245).Yadav, S., Ekbal, A., Saha, S., Bhattacharyya, P., & Sheth, A. (2018). Multi-tasklearning framework for mining crowd intelligence towards clinical treatment.In
Proceedings of the 2018 Conference of the North American Chapter of the ssociation for Computational Linguistics: Human Language Technologies,Volume 2 (Short Papers) (pp. 271–277).Yadav, S., Ramteke, P., Ekbal, A., Saha, S., & Bhattacharyya, P. (2020). Explor-ing disorder-aware attention for clinical event extraction. ACM Transactionson Multimedia Computing, Communications, and Applications (TOMM) , ,1–21.Yan, X., Li, L., Xie, C., Xiao, J., & Gu, L. (2019). Zhejiang university atimageclef 2019 visual question answering in the medical domain. In CLEF(Working Notes) .Yang, M., Liu, S., Chen, K., Zhang, H., Zhao, E., & Zhao, T. (2020). AHierarchical Clustering Approach to Fuzzy Semantic Representation of RareWords in Neural Machine Translation.
IEEE Transactions on Fuzzy Systems ,.Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked AttentionNetworks for Image Question Answering. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition (pp. 21–29).Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal factorized bilinear poolingwith co-attention learning for visual question answering. In
Proceedings of theIEEE International Conference on Computer Vision (ICCV) .Yu, Z., Yu, J., Xiang, C., Fan, J., & Tao, D. (2018). Beyond bilinear: Generalizedmultimodal factorized high-order pooling for visual question answering.