Medical Information Retrieval and Interpretation: A Question-Answer based Interaction Model
Nilanjan Sinhababu, Rahul Saxena, Monalisa Sarma, Debasis Samanta
MMedical Information Retrieval and Interpretation: AQuestion-Answer based Interaction Model
Nilanjan Sinhababu a , Rahul Saxena b , Monalisa Sarma a , Debasis Samanta c, ∗ a Subir Chowdhury School of Quality and Reliability, IIT Kharagpur, India b Department of Chemical Engineering, IIT Kharagpur, India c Department of Computer Science and Engineering, IIT Kharagpur, India
Abstract
The Internet has become a very powerful platform where diverse medical in-formation are expressed daily. Recently, a huge growth is seen in searches likesymptoms, diseases, medicines and many other health related queries aroundthe globe. The search engines typically populates the result by using the singlequery provided by the user and hence reaching to the final result may requirea lot of manual filtering from the user’s end. Current search engines and rec-ommendation systems still lacks real time interactions that may provide moreprecise result generation. This paper proposes an intelligent and interactivesystem tied up with the vast medical big data repository in the web and illus-trates its potential in finding medical information.
Keywords:
Information Processing, Computerised Interaction, QuestionAnswering System, BiLSTM, Attention Mechanism ∗ Corresponding author.
Email addresses: [email protected] (Nilanjan Sinhababu), [email protected] (Rahul Saxena), [email protected] (Monalisa Sarma), [email protected] (Debasis Samanta )
Preprint submitted to Expert Systems with Applications January 26, 2021 a r X i v : . [ c s . M A ] J a n . Introduction Numerous health related information are streamlined by the Internet. TheInternet empowers users to pick up fast admittance to data that can help inthe analysis of medical issues or the advancement of appropriate treatments.It enable consumers to gather health-related information themselves and fromthe comfort of their home. The Internet can likewise provide various healthrelated information beyond the immediate arrangement of care. It requiresno administrative or financial overheads. Provided these benefits, many con-sumers tend to use internet as a source of their medical information. But atthe end of the day, searching through this huge amount of data and collectinghealth related information is quite di ffi cult of not impossible task. Also, thereis always a chance of mistake in manual checking of the results provided by thesearch engines. Although with advancements in search engine technologies thesearching is far superior than before, but they are built to provide immediateresults based on a single query. Hence, these kind of systems are only suitablefor advanced users at least for medical domains. The process of finding a infor-mation in internet can be divided into two major steps. Firstly, the extractionof data present in the web regarding a particular query provided by the user.This is the step that the search engines are well optimised to do. Secondly,narrowing down the huge data collected in the extraction phase by using someinteraction mechanism. And finally, providing the user with the well filtereddata that does not require manual filtering from the user’s end. To performsuch kind of intelligent searching, the two most important techniques requiredare information retrieval and natural language interactive question generation.Information discovery and interaction systems have been closely connected formany years and are the two aspects of computation that has proven to be su-perior in various domains. Recent advances in these techniques have provideda way of enabling truly “intelligent” medical information retrieval system.Presence of huge volume of data, specifically unstructured data in the inter-2et has made it a quite di ffi cult task to discover and find crucial information.Information retrieval is a key in many fields with an intent to collect informa-tion or develop knowledge databases. Earlier, these unstructured informationrequired manual intervention for any kind of knowledge discovery. But withadvancement of machine learning techniques the process of extracting infor-mation has become automatic and hence gained a lot of importance currently.With a lot of research in the machine learning for text data, various methodsfor information discovery has been proposed. Techniques such as embeddingand clustering has provided the community with ways of dealing with unla-beled text data.On the other hand, question answer modelling systems are becoming oneof the most important and emerging areas of interest in the research commu-nity. Earlier interaction with a computer were limited to rule based methodsonly. But with improvements in deep learning models, computer interactionsare becoming more intelligent and accurate. These systems are able to helpteachers and students by generating intelligent questions like fill in the blanks,MCQs and subjective questions from paragraphs, sentences and words. Fur-thermore, these systems are already proved to be beneficial to the artificial in-telligence assistants and robots where interaction is required from both parties ? . Question answering system can be beneficial for some casual users who mayask simple factual questions ? , Doctors who seek quick answers on medicine ormedical equipment; Patients who seek medical treatment information online ? ;This paper presents a novel way of extracting unstructured information fromthe internet and providing users with medical queries the most useful resultsusing interaction. There are various research works that focus on converting unstructureddata into some structured form for information retrieval purpose ??? . Theygenerally use rule based methods for known data and for unknown data gen-eral machine learning and statistical methods are applied to get certain infor-3ation ??? . Regarding the question generation models, the existing techniquesgreatly depend on the context of the input text documents and the requiredoutputs. General RNN models does not capture enough context informationto generate grammatically and contextually correct questions ? . To overcomethis issue, LSTM models are proposed and bi-LSTM in general performs thebest considering the problem of question generation. To further improve thequestion generation models, attention and coverage mechanisms are utilized,outperforming other models in this domain ?? . Existing information retrieval methodologies are prone to unreliability dueincreasing variety of enormous unstructured data ? . These models are generallystatic and serves only for a specific type of data ?? . With our literature survey,we were unable to find a research work that focuses on dynamic informationretrieval. Further, information retrieval by reducing the unstructured web datacorpus based on a interaction mechanism is totally missing in the literature.Our objective is to dynamically filter data to provide users with the minimalfocused data using some interaction mechanisms. To do this work, the exist-ing works for information retrieval are not su ffi cient. Further, bi-LSTM withattention and coverage su ff ers the most when the input sequence is too small ? . That means, the model performance is greatly dependent on the length ofthe input sequences provided. This particular work requires the question gen-eration model to be able to generate a valid context based question even witha single word, but regarding the current state of the art techniques, they areprone error for smaller sequences of input. A model that can generate propercontext based questions even with small (single word) of the input sequencesis missing and needs to be investigated. The importance of question generation models are taken into considera-tion to address the man-machine interaction problem and how some intelli-gent question(s) can be formulated in order to retrieve some relevant context4nformation. Further, these context information can be interpret using the an-swer(s) provided by user to derive some decision. To overcome limitation ofthe existing models, this research paper has three main objective as pointedbelow:1. Identifying a questionable sentence, which have enough context infor-mation for generating some decision.2. Generating contextually correct and medical related natural languagequestions from the questionable entity even with smaller sequences ofinputs.3. Analysing the answers provided by the user to reduce the corpus volume.The later section of this paper is arranged in the following sequence. First,the related work and state of the art is identified. Then the task definitionis outlined, followed by the demonstration of the model structure. Then theexperimental parameters are discussed for the experiment and finally, results,discussion and validity of the model is presented.
2. Related Work
In this section, the related work is divided into di ff erent sections using thedi ff erent categories that the documents contain. This can be helpful in cat-egorization of the content of the research papers more easily and hence thereadability is highly increased. The related work is divided for informationretrieval in unstructured data and question answer generation techniques. Andrew McCallum in 2005 ? showed di ff erent ways to convert raw un-structured data from web search to structured data were discussed like rulebased model, hidden Markov model, conditional probability models, text clas-sification etc.Johanna Fulda et al. in 2015 ? developed a browser-based authoring toolthat automatically extracts event data from temporal references in unstruc-5ured text documents using natural language processing and encodes themalong a visual timeline. It uses context-dependent semantic parsing for entityextraction.Isabelle Augenstein et al. in 2012 ? modelled an approach (LODifier) thatcombines deep semantic analysis with named entity recognition, word sensedisambiguation and controlled Semantic Web vocabularies in order to extractnamed entities and relations between them from text and to convert them intoan RDF representation which is linked to DBpedia and WordNet.Rabeah Al-Zaidy et al. in 2012 ? provide an end-to-end solution to au-tomatically discover, analyze, and visualize criminal communities from un-structured textual data. Discover all prominent communities and measure thecloseness among the members. generate all indirect relationship hypotheseswith a maximum, user specified depth. E ffi ciently pruning the non promi-nent communities and examining the closeness of the ones that can potentiallybe prominent. They used Named Entity Tagger, Apriori algorithm, open andclosed discovery algorithms described in Srinivasan (2004)Zuraini Zainol et al. in 2017 ? discusses about the development of textanalytics tool that is proficient in extracting, processing, analyzing the un-structured text data and visualizing cleaned text data into multiple forms suchas Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph(based on co-occurrence), Word Cloud and Dendogram (tree structured graph).Byung-Kwon Park and Il-Yeol Song in 2011 ? performed text mining oper-ations and used XML-OLAP, DocCube, Topic Cube to extract structured datafrom text documents. Based on the schema of structured data. The architec-ture is based on the concept of OLAP, consolidation OLAP, which integratestwo heterogeneous OLAP sources relational OLAP and Text OLAP. Consolida-tion OLAP enables users to explore structured and unstructured data at a timefor getting total business intelligence, and to find out the information that oth-erwise could be missed by relational or text OLAP when processed separately.Shane Ahern et al. in 2007 ? developed a sample application that generatesaggregate location-based knowledge from unstructured text associated with6eographic coordinates. Used the k-Means clustering algorithm, based on thephotos’ latitude and longitude. Author used a TF-IDF approach that assigns ahigher score to tags that have a larger frequency within a cluster compared tothe rest of the area under consideration.K.L.Sumathy and M.Chidambaram in 2013 ? gives an overview of the con-cepts, applications, issues and tools used for text mining. Steps involved intext mining are discussed. A list of tools used is also given.Paolo Nesi et al. in 2014 ? used part-of-speech-tagging, pattern recogni-tion and annotation for extracting addresses and geographical coordinates ofcompanies and organizations from their web domains.Ammar Ismael Kadhim et al. in 2014 ? proposed processing of unstruc-tured data. Their methodology includes preprocessing (stemming, removingstop words, removing highly frequent and least frequent words). TF-IDF wasused for document representation. SVD was used to reduce high dimensionaldataset to lower dimension. Inter AI Developer Program in 2019 ? developed a system which can gen-erate logical questions from given text input that only humans are capable ofdoing. Their process involved of selecting important sentence, using parserto extract NP and ADJP from important sentences as candidate gaps and thengenerate fill in the blanks and brief type questions using NLTK parser andgrammar syntax logic. They have succeeded in forming two types of questionsfill in the blanks statement and fully stated questions.Alec Kretch in 2018 ? developed T2Q for various types of questions us-ing US history dataset. A list of subjects and corresponding phrases was builtfrom the tokens and POS tags using text chunking, Stanford Dependencies anda multiclass classifier. Universal Dependencies were extracted such as from us-ing Stanford CoreNLP’s UDFeatureAnnotator then some words were convertedto a Synonym using Princeton WordNet or Glove. T2Q also matches commonpattern that identify the date/year of a given event. Questions of each question7ype for each subject/phrase were formed using basic pattern of each questiontype as guide. Then similar, but false, answers for each question based on thequestion type using patterns of good false answers for each question as a guidewere formed.Athira P. M. et al. in 2013 ? describes the architecture of a Natural LanguageQuestion Answering (NLQA) system for a specific domain based on the onto-logical information, a step towards semantic web question answering. Theirmain steps include syntactic analysis , semantic analysis, Question Classifica-tion, Query Reformulation. They also performed answer filtering and answerranking. Their results showed that they were able to achieve 94 % accuracy ofnatural language question answering in their implementation.Ming Liu et al. in 2012 ? proposed a semi automatic question generationto support academic writing. Key concepts were identified using an unsuper-vised algorithm to extract key phrases from an academic paper. The systemthen classifies each key phrase based on a Wikipedia article matched with thekey phrase by using a rule-based approach then Wikipedia was used as a do-main knowledge base. Knowledge from a single article is used to build con-ceptual graphs used to generate questions. To evaluate the quality of generatedquestions a Bystander Turing test was conducted which showed good systemrating.Eiichiro S. et al. in 2005 ? proposed a technique of automatic generationof fill in the blanks questions (FBQs) together with testing based on Item Re-sponse Theory (IRT) to measure English proficiency. A method based on iteminformation was proposed to estimate the proficiency of the test-taker by us-ing as few items as possible. Results suggest that the generated questions plusIRT estimate the non-native speakers’ English proficiency while on the otherhand, the test can be completed almost perfectly by English native speakers.The number of questions can be reduced by using item information in IRT.Yllias Chali and Tina Baghaee in 2018 ? proposed a sequence to sequencemodel that uses attention and coverage mechanisms for addressing the ques-tion generation problem at the sentence level. The attention and coverage8echanisms prevent language generation systems from generating the sameword over and over again, and have been shown to improve a system’s output.They have used the simple RNN encoder-decoder architecture with the globalattention model. Further, they applied a coverage mechanism, which pre-vents the word repetition problem. Experimental results on the Amazon ques-tion/answer dataset showed an improvement in automatic evaluation metricsas well as human evaluations from the state-of-the art question generation sys-tems.Xinya Du et al. in 2017 ? framed the task of question generation as asequence-to-sequence learning problem that directly maps a sentence from atext passage to a question. Their approach is totally data driven and requiresno manually generated rules. They modeled the conditional probability usingRNN encoder-decoder architecture and adopt the global attention mechanismto make the model focus on certain elements of the input when generatingeach word during decoding. They investigated two variations of their models:one that only encodes the sentence and another that encodes both sentenceand paragraph level information. Automatic evaluation results showed thattheir system significantly outperforms the state of the art rule-based system.In human evaluations, questions generated by this method are also rated asbeing more natural.
3. Proposed Methodology
Before describing the detailed outline of this architecture, I shall use a realexample to show how exactly it corresponds to the way a human tackles suchproblems. Imagine a scenario where a student has to find a particular wordin a book without an appendix section. To perform this search operation, themost e ff ective way is to know which domain this particular term belongs toand hence a relation of that can be found on the chapter name. After reachingthat chapter, the student can relate to which particular section the term may9e occurring and hence selects that. If the student is unable to find the termin that section, then he repeats it for the other sections that are related, untilthe term is found. The other non-human way is maybe searching the termthroughout the book either serially or randomly. To solve this type of problemin computers, advanced and interactive techniques must be used to get downto the final result from a huge collection of unstructured data collected on aparticular query. User s QueryGather DataData Preprocessing ClusteringGenerate CRMRecluster? Select ClusterYesSelect Cluster NoRank Cluster Items Question GenerationSelect Top ItemUser s Answer Terminate? NoResult Yes
Figure 1: The proposed model framework
To carry out the interaction based information retrieval, a model is pro-posed as shown in Figure-1. The two main objectives of this model is to find aword or a sentence from which a question can be developed and a specializednatural language question generator that can generate questions using eithera word or a sentence. First phase of this model deals with the objective ofidentification of questionable entity. To identify a questionable entity, the un-structured raw data can be clustered into various sections and using interactionmechanisms it can be converged into a particular information that the user islooking for. The model consists of a section that retrieves unstructured data10rom the web following a query provided by the user. That raw data containsmany unnecessary tokens that needs to be either removed or changed in or-der to make the later processes more accurate. The preprocessed tokens(wordshere) are fed into the clustering section. The K-Means clustering algorithm isselected for clustering as per the experiment in Section-3.2 and research showsK-Means to be beneficial for text data clustering ? . After clustering is complete,a technique is proposed to identify the relations between the clusters in a se-mantic way. This technique is named as Cluster Relationship Matrix (CRM).Using the CRM, a semantic distance is calculated and provided with a hyperparameter threshold δ . A mechanism is developed to identify cluster qualityand tell whether any further re-clustering is required or not. If re-clusteringis required, then a cluster using Algorithm 2 (Step-6) is selected. Similarly, ifre-clustering is not required then a cluster is selected according to Algorithm 2(Step-5) for ranking. A ranking mechanism is used to rank the items present inthe selected cluster. After the cluster items are ranked, the sentence containingthat particular item is selected and passed to the question generation phase.The above section describes the process used to identify a questionable en-tity. Once identified, this entity (basically a word or a sentence from whichquestion can be formed) is passed to a sequence-to-sequence ? encoder-decoderRNN model ? to generate questions. A hierarchical attention model ? is intro-duced into the attention layer for this question generation model to provideboth word based and sentence based attention discovery. This helps in genera-tion of quality questions both for input word or sentences. Later, analysing themodel using the provided answer is used to estimate the expected conclusionof the job and whether to terminate or continue the process. After termination,the final output result is provided to the user. Clustering is an integral part of machine learning process with unlabelleddata. Selection of suitable clustering algorithm plays an important role in mak-ing the process of data discovery more accurate and e ffi cient. The data to be11lustered in this paper has two important properties, firstly, they are of textand numerical type and secondly they are collected with respect to a particu-lar query. As per the clustering is concerned, the words are selected as tokensto be clustered. For testing the capability of clustering algorithms on text typedata relevant to this work, a basic experiment is performed with a text corpuscreated by combining datasets from three labelled dataset. The clustering al-gorithms can be broadly divided into three types. So for the experiment, oneof the mostly used algorithms in this particular domain is selected from thefollowing three categories.1. Partition based clustering - In partition clustering, each and every pointin the dataset is assigned to exactly one of the K clusters formed by thealgorithm where K is a parameter that has to be passed to the algorithmthere are various methods available to find optimal K ? . The most widelyused algorithm that belongs to this category is K-Means. K-Means al-gorithm scales very well with large data ?? . K-Means basically try tominimize WCSS (within cluster sum of squares). K-Means with cosinesimilarity is spherical clustering ? .2. Hierarchical clustering - Hierarchical clustering generates a tree struc-ture of the nested group of clusters ? . The Agglomerative Clustering isa widely used algorithm in this category. It is a bottom-up algorithm Itinitially treats each data point as a cluster and then they are successivelymerged together. For merge strategy, a metric is used known as linkagecriterion ?? .3. Density-based clustering - The idea behind density-based clustering isto find clusters as high-point density that separates the area of low-pointdensity which leads to an arbitrary shape cluster. High-density pointsconsidered to be within clusters and low-density points are consideredto be noise. One of the most widely used algorithms in this categoryis GMM. GMM assumes there are K Gaussian distributions from whicheach and every data point was generated. All these K Gaussian distribu-tions have unknown parameters. GMM can be thought of as an extension12f K-Means where information about the covariance structure of the dataand the centres of the latent Gaussians are also incorporated into the al-gorithm ? .For clustering evaluation, four algorithms are selected namely Sphere clus-tering, KMeans with Euclidean, Agglomerative clustering and Gaussian Mix-ture Model Clustering (GMM) due to their high accuracy in text data. Thesteps associated with the selection of the best algorithm for this particular ex-periment are preprocessing, embedding and finally the selection, are discussedin the sub-sections below. The web sources used to get labelled data had data arranged in tabular, listformat, so not much preprocessing was required. Therefore, only basic prepro-cessing was done before using Sentence transformers for embedding using thesteps discussed below.
Algorithm 1
Preprocessing for clustering algorithm selection stage Replaced “ ” with spaces Converted all characters to lowercase Replaced “TABS” with white spaces Removed all special characters Removed all multiple spaces Applied lemmatisationAfter preprocessing is complete, three lists are generated. One containingsymptoms with 312 number of items, one containing diseases with 315 numberof total items and the other containing medicines with a total of 300 items.
Embedding is a form of representation of text data in a multidimensionalspace, where their semantic meaning and relation is conserved. Embedding13re very important in training machine learning models as they provide thesource of knowledge of the unknown data to the model. An embedding can betrained with our own data or a pre-trained model can be used. In this paper awell known pre-trained embedding model Sentence Transformer embeddingswith “distilroberta-base-paraphrase-v1” model is used ? . This gives 768 di-mensional embedding for the data. Sentence Transformers were used becausemost of the symptoms, medicines or even diseases are not single word, insteadthey are combination of multiple words. The distilroberta-base-paraphrase-v1model is trained on millions of paraphrase examples. They create extremelygood results for various similarity and retrieval tasks. Further, PCA was usedto reduce dimension of data from 768 to 200 as High dimensional data is notsuitable for clustering. Four clustering algorithms were selected as discussed, namely KMeans,Sphere clustering (KMeans with cosine similarity), Agglomerative clusteringand GMM clustering. The experiment for selection of the algorithms used var-ious parameters and considerations that are provided below.•
K-Means -
K-Means was used with the number of clusters as 3 and with‘K-Means++’ as method of initialization to speed up the convergence.•
Spherical clustering -
All parameters used were the same as K-Means butthis uses cosine similarity whereas K-Means uses euclidean similarity.•
Agglomerative clustering -
The number of clusters were taken as 3. wardwas chosen as linkage criterion which minimizes the variance of the clus-ters being merged. Euclidean metric was used to compute the linkage.•
GMM clustering -
The number of mixture components were taken as 3with co-variance type as full which means each component has its owngeneral co-variance matrix. 14 .3. Identification of Questionable Entity3.3.1. Multilayered Clustering
An algorithm is proposed to retrieve useful information from an unstruc-tured preprocessed text corpus. The input to this algorithm is unstructuredtext and the objective output is a list of ranked (ranking is defined in subse-quent sections) words from the cluster having maximum avg distance to allother clusters. Here, the distance is calculated via a novel cluster distance met-ric developed specifically for the type of the data, this paper is dealing with.The algorithm for multilayered clustering is provided below:15 lgorithm 2
Multilayered Clustering First, the given text corpus is clustered into N (= 3) initial clusters usingKMeans. We have chosen N=3 initially because our main focus on ourstudy is on medical text corpus which generally consists of three group ofwords, which are medicines/treatments, diseases & symptoms. Then a cluster relationship matrix(CRM) is created of size NXN whereCRM[i][j] = distance metric of cluster pair i, j. This distance metric in-dicates how distinct two clusters are with respect to each other. Higherthe distance metric farther the two clusters are. Detailed description of theCRM is provided in Section-3.3.2. Then an un-directed graph is created with each node as a cluster (0 to N-1) and M[i][j] as the weight of an edge connecting two nodes(clusters) i , j . Now for each node sum of weights of all edges directly connected to itis calculated (there exist N-1 such edges for each node). Now out of all Nnodes, two nodes are selected for which this calculated value is maximumand minimum let’s call these nodes as maxDisN ode and minDisN ode . Theclusters corresponding to these two nodes are chosen as maxDisCluster and minDisCluster . Then carefully a hyper-parameter threshold ’ δ ’ is chosen (this thresholdcan be dynamically updated and tuned with each iteration of the algo-rithm). Now if each and every value of M[i][j] in N × N cluster relationshipmatrix is less than the threshold then reclustering (Step 6) is required elseranking (Step 5) is selected. Ranking - In this phase, maxDisCluster is picked and then algorithm sortthe words in that cluster according to the increasing euclidean distancethat the vector representation of that word has from the centre of the clus-ter and then the algorithm ends. Reclustering - In the reclustering minDisCluster is picked and then al-gorithm divides that cluster into two clusters so the total number of thecluster now changes from N to N+1. Accordingly, the dimension of thecluster relationship matrix changes to (N+1)X(N+1) and its values also up-date accordingly. Now go to step 4.16 .3.2. Cluster Relationship Matrix
Our proposed algorithm takes N clusters as an input and gives cluster re-lationship matrix as the output. It is an N × N matrix where N is the totalnumber of clusters present at the given moment where M [ i ][ j ] = distance met-ric of cluster pair i, j. This distance metric indicates how distinct two clustersare with respect to each other. Higher the distance metric more distinct thetwo clusters are (varies between zero to one). To calculate M [ i ][ j ] (distancemetric of cluster pair i, j) a special distance Word Movers Distance (WMD) iscalculated between each pair of words from cluster i and cluster j , and then aresummed up. Finally, the whole matrix is normalized to have values between 0and 1. Word Movers Distance (WMD):.
Term frequency-inverse document frequency(TF-IDF) and the bag of words (BOW) are two most common ways to repre-sent documents however, these ways are frequent near orthogonal so these arenot suitable for document distances. Also, the distance between individualwords is not captured in these ways. Many attempts have been made to solvethis problem most of which learned a latent low dimensional representationof documents. Although these attempts produced a more reasonable repre-sentation of documents as compared to BOW but they in many cases do notimprove distance-based tasks performance of BOW. Due to these limitationsWMD was introduced to capture distance between the two documents in amuch better way. In our case, we are trying to keep words with less distancetogether in the same cluster WMD metric is suitable for our use case. WMDperformance was evaluated using KNN on 8 document datasets and comparedit with BOW, TFIDF, BM25 LSI, LDA, mSDA, CCG. The results were impres-sive. It was found out that WMD performed best on six out of eight datasets.Even on the two datasets on which WMD was not the best performer the errorrates were very close to the top performers.WMD uses word embeddings like word2vec, GloVes, etc that learns seman-tic meaningful vector representation of words from their local co-occurrences17n the sentence. MWD measures the semantic distance between two docu-ments i.e. it measures the minimum amount of distance that the embeddedwords of one document need to “travel” to reach the embedded words of an-other document. It does not require any hyperparameters. It is highly in-terpretable and has high retrieval accuracy. The WMD distance between twodocuments is defined by the three main parts as described below:1.
Document representation - The document is represented as an n dimen-sional d ∈ R n vector where n is the vocabulary size. Each element of thisvector is a word’s normalized frequency in the document. This vector isalso known as normalised bag of words (nBOW) vector. If a word i ap-pears C i times in the document then d i is found using equation-1. Thisvector is usually very sparse. d i = C i (cid:80) nj =1 C j (1)2. Semantic similarity metric - The travel cost C ( i, j ) from word i in onedocument to word j in another document is defined in equation-2. C ( i, j ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i − X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (2)where, X i and X j are embedding representation of words i and j respec-tively.3. Flow matrix - The flow matrix T ∈ R n × n where, n is the vocabulary sizeis a sparse matrix where T ij ≥ i in onedocument travels to word j in another document Ranking.
This algorithm takes maxDisCluster as an input and gives a rankedlist of words from that cluster as an output. The ranking is an e ff ort to find themost representative words of a cluster. The words with a higher ranking arebetter representative of the cluster as compared to words with a lower ranking.Once no reclustering is required the algorithm selects maxDisCluster (definedin Section-3.3.1) and sort words (words appearing first have higher rank) in it.18orting is done using Algorithm-3. D i = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) j =1 ( C j − E ij ) (3)where, K is the dimension of the embedding used. Algorithm 3
Semantic ranking of elements in the selected cluster Let the centre of the maxDisCluster be C . Let our cluster has M words in it and let the i th word be W i and its embed-ding representation be E i For each i th word W i in maxDisCluster euclidean distance between E i and C is calculated let this calculated value be D i as shown in equation-3. Now, all M words in maxDisCluster are sorted based on distance D i i.e.the words having smallest D i will come first. The final sorted list is returned.
Natural language question generation is an important aspect of NaturalLanguage Processing field in computer science due to recent advancement ineducation systems and interaction based autonomous systems. This section ofthe paper focuses on the methodology followed in generation of questions rel-evant to medical domain when provided either with a word or a sentence D = W n , where n − n = 0represents a single word. Here, the main objective is to maximize accuracyof the conditional probabilities P( Q | W n ); where Q is the question and W n isthe input provided to the system of which a question needs to be generated(Equation-4). p ( Q | W n ) = n (cid:89) p ( q t | q t − , W n ) (4)19or generation of natural language question, a Sequence-to-Sequence learn-ing ? Encoder Decoder model for Recurrent Neural Networks (RNN) ? is usedalong with improved coverage mechanism ? for generating natural languagequestions for the input data D . A generic attention layer provides attentiondiscovery for multiple words and usually su ff er when sentences consists oflower number of words or if it a single word. To overcome this issue, Hi-erarchical Attention Network (HAN) ? is used along with the general attentionlayer; this combined attention model ensures the classification from both wordlevel and sentence level. So that it can identify the presence of words that con-vey a higher weight to the context of the overall sentence. Since, HAN doesn’tconsider long term information about the previous sentences, it causes to looseinformation for longer sentences. Hence, the combined attention model canselect the highest attention weights whenever applicable, in order to keep thecontext of the sentence through longer sentences. A well known pre-trainedembedding GloVes is used and this model is trained, validated, and tested byboth the AmazonQA dataset and PubMed datset. The content provided as training, contains sample of both, more and lessimportant data. Knowledge of this content is usually achieved using attentionmechanisms over the sequence of data. But the attention models usually facedi ffi culty to identify the contribution of words for smaller sequence of inputs.To overcome the issue of attention discovery for smaller sequences of inputsand even for words, a hierarchical attention network(HAN) ? proves to be agood option. It has two levels of attention mechanisms applied at the wordand sentence-level, enabling it to attend di ff erently to more and less importantcontent. HAN calculates the attention in terms of words and sentences hencetwo context vectors are used, one for word level attention and another for sen-tence level attention. The process HAN starts by word level encoder followedby a attention mechanism only for words (Equation-7), then those attentionvectors are passed to the sentence level encoders followed by the sentence at-20ention (Equation-9). u it = tanh ( W w h it + b w ) (5)Where, W s are the weights associated with the concatenated hidden states h it and b w is the bias. α it = exp ( u Tit u w ) (cid:80) t exp ( u Tit u w ) (6)Where, u it is the hidden representation of h it and u w is word level contextvector. T is the total number of input sequences. The value of u it is calculatedfrom Equation-5. s i = (cid:88) t α it h it (7)Where, s i denotes the word level attention, and α it is calculated from Equation-6 and h it are the concatenated hidden states of the network for a bi-directionalmodel. For sentence level attention, the general attention model (Equation-9)is selected. The input and output in a question generation model is dynamic in naturedue to the fact that the input can have varying number of words and samefor the output question. To deal with these kind of inputs and to convert thevariable length input data into a fixed length output, an encoder is used. It isa set of neural network layers that maps the input data D into word vectors,which are populated in the network as hidden states H = { h , h , ..., h n } , where D = W n . For this experiment, the selected encoder is a bidirectional LSTM layer ? . Two layers of the bidirectional LSTM is considered in this experiment, as itis already proven to be good in similar context ? . The bidirectional LSTM layerhas two hidden states, one is a forward state f i and the other is a backwardstate b i , for an index i , enabling the network to understand a sentence and itsformation from both starting and from the ending. Considering a input index21f i , the hidden states are concatenated to form a longer sequence of hiddenstates such that h i = { f i , b i } . The sequence of hidden states h i is used duringdecoding for generation of output vector q t as a method to identify the sourceand predicting the next target word. Target word in question, q t is calculatedas the weighted sum of hidden states h i as expressed in Equation-8. q t = max [ n (cid:88) i =1 a t ( n ) H ( n ) , n (cid:88) i =1 s i ( n ) H ( n )] (8)where, a t and s i are shift vectors and are calculated according to the generalattention model and hierarchical attention model (described in Section-3.4.2)respectively. a t ( i ) = exp ( h Tt W a h i ) (cid:80) j exp ( h Tt W a h i ) (9) A sequence-to-sequence model requires a decoder in order to convert thefixed length context output provided by the encoder and convert it to a variablelength output. Two layers of LSTM is used as a decoder in the experiment. Toavoid attending repetitive words, the attention model provides coverage vectorfor each input in the time step. The attention model must attain to the nextinput taking into consideration the previous inputs using the coverage vector c at a time step t . c t = t − (cid:88) t (cid:48) =0 at (cid:48) (10)Further, the coverage vector needs to be integrated with the concatenatedhidden states, and this is done using a (cid:48) tanh (cid:48) operation over the hidden states h i and point wise addition of coverage vector c t and the weights of the coveragevector w c to be learned. h i = tanh ( h i + w c c t ( i )) (11)22he context vector s t and the output of the previous layers [ z , z , ..., z ( t − ]in the decoder are fed to the next layer in the current time step to make theprediction. The prediction is made using a fully connected layer with a soft-max classifier. The attention hidden state ˜ h t is calculated as a (cid:48) tanh (cid:48) activationoperation on the weights of the context vector w x to be learned, the contextvector from the source s and the next target state q t .˜ h t = tanh ( W x [ s t ; q t ]) (12)Finally, the next hidden state is formed using a LSTM cell with the previoushidden state h t − and the output from the previous state z t − as an input. h t = LST M ( z t − , h t − ) (13) In the training phase, the generated model is trained using the dataset cor-pus containing sample questions and their respective answers. Considering thetotal number of training data as T , the corpus data is defined as C d = ( a i , q i ) T .The model is trained by providing the answers as an input to the model. Fora particular sample answer, the model predicts the questions, which is thenoptimised by minimizing the negative log-likelihood of the training corpus(Equation-14). For word level training, the same model is retrained takingeach of the single word in the sample answer (not considering common words)(Equation-14). M t = T (cid:88) i =1 − logp ( q i | a i ) (14)where, M t is the training model and the rest carry their usual meaning. M t = T (cid:88) i =1 G i (cid:88) j =1 − logp ( q i | a ij ) (15)where, M t is the training model and G i is the maximum number of wordsconsidered as an input for i th corpus answer.23o generate the natural language questions from the model, the predictionvector mapping is performed. Due to limited embedding corpus, many corpuswords will be new to the model, these unknown tokens are substituted byaverage attention weight of the words from the source sentence. In this phase, after the generation of question, the user is required to pro-vide some answer to the question. The answer to the question needs to beevaluated by the system such that the information and context related to thequestion is either kept or eliminated. This phase is also responsible regardingcontinuation or termination of the current session of information retrieval. Toperform these operations, the following algorithm is designed.
Algorithm 4
Elimination and Termination First, the user input A u (answer) is preprocessed to remove unnecessaryand irrelevant tokens. Then, a sentence concatenation operation of the answer and the questionis performed by removing the question word like what, when, where, etc.to get a modified A u . Then, a similarity measurement is performed
W MD ( a i , A u ) for i th predic-tion. Considering the same hyper-parameter threshold (cid:48) δ (cid:48) as defined in Section-3.3.1, if W MD ( a i , A u ) ≥ δ then go to step-5 else go to step-6. Corpus data a i is kept intact and continuation of the steps is performedwith the selected list and rest of the clusters are removed. Go to step-6. Corpus data a i is removed and perform step Reclustering (Algorithm-2)eliminating the row and column corresponding to the removed cluster. Ifre-clustering is required then continuation of the steps is performed withthe rest of the clusters removing the CRM values (refer to Section-3.3.1).Else the process is terminated. 24 . Experiments and Results
This paper has various experiment associated with the di ff erent sectionsof the work. Hence, after providing a brief info about the experimental en-vironment and the dataset description, the result section is divided into fewcategories to improve the readability:1. Selection of clustering algorithm:
Clustering algorithm plays a vitalrole in entity selection phase of the work. To select a particular clusteringalgorithm in the context of our work, a separate experiment is performedto evaluate the performance of various clustering algorithms.2.
Results for Question Generation:
Results associated with the gener-ation of medical based question generation is presented in this section.Expert opinion is also taken into consideration for question quality eval-uation.3.
Comparison with the Existing Approaches:
Performance comparisonof the proposed system with the existing works using same data are per-formed.
The specification of the system used is as follows: O.S. - Windows 10 Pro-fessional, CPU - AMD® Ryzen™ 7-3700X Processor, RAM - 32GB DDR4, GPU- NVIDIA GeForce® GTX 1080 Ti. The windows version of the Python-64Bit ? with editor IPython notebook ? throughout the experiment. Important mod-ules used in the experiment include tensorflow ? , packages from NumPy ? ,SciPy ? , scikit-learn ? and Matplotlib ? . Two di ff erent dataset are used in this section, namely Pubmed 200K dataset ? and Amazon QA dataset ?? ; to train the Recurrent Neural Network modelsat two di ff erent level. Pub-med has a very rich repository of medical data25nd research articles; in contrast, Amazon QA dataset has a good collectionof opinion question and answer data for the actual natural language questionformation phase of the experiment. To test the model, the labelled AmazonQAdataset is used; After training and testing the model with varying parameters,a model is selected which provided the highest accuracy using K-fold crossvalidation technique. After that, this model is fed with the output word orsentence from the proposed method of identification of questionable entityas discussed in Section-3.3, in turn, the model is capable of providing somerelevant questions from the sentences of the article.Dataset Training Validation TestingPubmed 184472 46654 34195Amazon QA 215325 65487 45574 Table 1: Dataset Statistics
Four top performing clustering algorithms for text related data are selectedfrom each category of the algorithms present. Then the testing has been per-formed, to select a particular algorithm suitable for this work. The visualiza-tion of each of the clustering algorithms are shown in Fig-2. From the visu-alization, it is seen that all of the clustering algorithms are able to detect theclusters with minor variations. (a) Spehere Clustering (b) KMeans Clustering (c) Agglomerative (d) GMM
Figure 2: Clustering Visualisation
Bcubed precision, recall, f score was used for evaluation. Bcubed metric26s widely used as standard metric for text clustering problems. It satisfy fourformal constraints Cluster Homogeneity, Cluster Completeness, Rag Bag andClusters size vs. quantity. The average BCubed precision is average precisionof all items in the distribution similarly the average BCubed recall is averagerecall of all items in the distribution. The BCubed precision of an item is ratioof number of items present in its cluster that have same category as its (includ-ing itself) to the total number of items in its cluster. The BCubed recall of anitem is the ratio of number of items present in its cluster that have same cat-egory as its (including itself) to the total number of items in its category. Forperformance evaluation, the general precision, recall and F-score metrics usedare provided in the equations 16, 17, 18.
P recision ( P ) = T PT P + FP (16) Recall ( R ) = T PT P + FN (17) F − Score ( F
1) = 2 × P × RP + R (18) Algorithm Precision Recall F score
Sphere clustering (KMeans with cosine similarity)
KMeans with Euclidean
Agglomerative Clustering
GMM Clustering
Table 2: Evaluation of clustering algorithm on medical texts
The evaluation result as shown in the Table-2 for the clustering algorithmshows K-Means to be the best performing model and hence is selected for thelater stages of the model. 27 .5. Results for Question Generation4.5.1. Model training and testing performance
The test is performed for both word level and sentence level question for-mation. For word level testing of the model performance, the most importantword from the sentence is selected and fed into the system. For sentence, thetotal sentence is passed as an input to the system. For the training and vali-dation statistics model accuracy and perplexity is considered. The accuracy ismeasured as the total number of words correctly predicted and matched ex-actly with the labelled data. On the other hand, model perplexity is defined asthe errors present in the prediction model.The objective is to always optimizethe model by minimizing the perplexity while maximizing the accuracy. Theword level training statistics is provided in Figure-3 and the sentence leveltraining statistics is provided in Figure-4. The optimal number of epochs isfound to be 20 and hence the model is trained and validated with 20 epochsas represented in the figures. The training and validation statistics confirmthat with more length of input there is a higher chance of getting close to theactual data. Since there are many factors associated with text data evaluationlike semantic relations, grammar, synonyms etc, which are not evaluated bythis measure. To further evaluate the model, metrics based evaluation is alsoperformed.
Words P E R P L E X I TY A N D A CC U R A C Y EPOCHSTraining Perplexity Train Accuracy (a) Training Statistics P E R P L E X I TY A N D A CC U R A C Y EPOCHSValidation Perplexity Validation Accuracy (b) Validation Statistics
Figure 3: Model perplexity vs accuracy statistics for single words P E R P L E X I TY A N D A CC U R A C Y EPOCHSTraining Perplexity Train Accuracy (a) Training Statistics P E R P L E X I TY A N D A CC U R A C Y EPOCHSValidation Perplexity Validation Accuracy (b) Validation Statistics
Figure 4: Model perplexity vs accuracy statistics for sentences
Evaluation of natural language machine learning models is far from justmatching the exact similarity of the prediction to the labelled data. It is dueto the fact that the natural language can be framed in various ways but stillbe correct even if it does not match the original labelled data. This makes itdi ffi cult to analyse the models using general metrics Accuracy or F1 score. Toovercome the issue, some metrics proposed in the literature such as METEORand BLEU that are proven to be quite good particularly in this type of task.The Bilingual Evaluation Understudy (BLEU), is a score for comparing thepredicted text data to a list of reference labelled text data ? . It is used to eval-uate a wide range of text generation models in natural language processingtasks. Despite of the fact that these evaluation models are not perfect but givesa much reliable evaluation when compared to the general metrics. BLEU iscomputationally inexpensive and is widely adopted. Further, it is languageindependent, making the model evaluation more easy to evaluate. In this ex-periment the medical corpus has a higher frequency of bi-gram data and henceconsidering those in the evaluation is required. So, we used cumulative n-gramBLEU score with n=1(BLEU-1) and n=2 (BLEU-2). METEOR ? is similar toBLEU in many aspects, but in addition, it is designed to consider synonymsand word stems for more realistic evaluation and it highly correlates with hu-man evaluations. The model prediction statistics with the evaluation metricsfor word and sentences are provided in the Table-3. From the results, it is29learly visible that a model is benefited always with a sentence rather than asingle word due to lack of context. But it does not mean that the model failsto generate an accurate question in spite of this. METEOR score for word tellsthat the sentence formation is good but BLEU scores tells that it does not matchwith the labelled data properly. As a side note, the results are produced usingGloVes with six billion words and 100 dimensions. Increasing this embed-ding data and dimension will help by reducing unknown tokens generated andhence will provide much better results. All the experiments performed havebeen modelled using the same data corpus and embedding for better evalua-tion. Input Sequence METEOR BLEU-1 BLEU-2
Word
Sentence
Table 3: Model performance evaluation with standard metrics for words and sentence
The state of the art existing approaches for question generation consid-ers a Bi-directional LSTM model with general attention applied at the hid-den states with a coverage mechanism for removing redundant information(BiLSTM+GA+C). On the other hand, our model utilizes word and sentencelevel hierarchical attention mechanism (BiLSTM+HA+C). The experiment isperformed separately for word and sentences to evaluate the model perfor-mance in both word and sentence level. The results shown in Table-4 repre-sents the results for the word level model comparison. The hierarchical atten-tion mechanism with word level embedding shows a much higher score for allthe evaluation metrics. 30 lgorithm METEOR BLEU-1 BLEU-2
BiLSTM+GA+C
BiLSTM+HA+C
Table 4: Model comparison with standard metrics for word
The results shown in Table-5 represents the results for the sentence levelmodel comparison. In this experiment, the results are equivalent and does nothave much of a di ff erence. Although, the model is similar for the sentencelevel question generation, the little di ff erence in the scores are mainly due tothe fact that many questions can be formed from a single input and hencethe the sequences may vary. Overall our model performs much better whenconsidering both the word level and sentence level evaluation. Algorithm METEOR BLEU-1 BLEU-2
BiLSTM+GA+C
BiLSTM+HA+C
Table 5: Model comparison with standard metrics for sentences
Question generation is the process of generation of question either from asingle word or a sentence. The testing is based on the labelled data that is suf-ficient to compare various models in the machine learning field. But, in orderto identify the true capability of the question generation model and to provideaccurate, error free and context related questions, human evaluations are alsoconsidered. We selected 15 volunteers in our lab with a good English back-ground for the o ffl ine evaluation of the model and 100 volunteers for an on-line evaluation making a total of 115 volunteers. For human evaluation of themodel performance, three categories are selected with scores between range 0and 1, as described below:1. Question Selection - This parameter is scored according to the systems31bility to pick correct questions based on the initial input query providedto the system. It is in fact the relevancy of the question according to theinput context.2.
Question Formation - This is the score that defines the quality of thequestion generation portion of the system. The quality parameters con-sidered are sentence formation and grammatical correctness.3.
Responsiveness - Responsiveness is the systems ability to interact withthe user. It is defined as the time taken by the system to respond tothe user once an input is provided. This does not include the overallduration of the session, rather it is defined for each interaction made bythe system.The results of human evaluation are provided in the Table-6. The humanevaluation shows that the system was able to generate semantically correctquestions both for word and sentence level inputs with appropriate context.Further, the responsiveness of the system is also acceptable.
Human EvaluationParameters Word Rating Sentence Rating Overall RatingQuestion Selection 0.5687 0.6470 0.7457Question Formation 0.3765 0.7352 0.7857Responsiveness 0.9665 0.7789 0.8595
Table 6: Human evaluation for question generation model
5. Conclusion and future scope
This paper presents a novel approach of information retrieval using in-teraction based mechanism and provides enough information regarding theneed of similar systems. A totally new framework is built for the identifica-tion of questionable entity and with the available evaluation methods, it wasidentified preforming as intended. Further, addition of word level attentionmechanism in the question generation phase has also proven to be e ffff