[PDF] Reading Comprehension using Entity-based Memory Network

Abstract

This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results.

Full PDF

RReading Comprehension using Entity-based MemoryNetworks

Xun Wang , Katsuhito Sudoh , Masaaki Nagata ,Tomohide Shibata , Daisuke Kawahara , and Sadao Kurohashi NTT Communication Science Labratories, Kyoto, Japan, wang.xun,sudoh.katsuhito,[email protected] Kyoto University, Kyoto, Japan shibata,dk,[email protected]

Abstract.

This paper introduces a novel neural network model for question an-swering, the entity-based memory networks . It enhances neural networks’ abilityof representing and calculating information over a long period by keeping recordsof entities contained in text. The core component is a memory pool which com-prises entities’ states. These entities’ states are continuously updated accordingto the input text. Questions with regard to the input text are used to search thememory pool for related entities and answers are further predicted based on thestates of retrieved entities. Compared with previous memory network models, theproposed model is capable of handling ﬁne-grained information and more sophis-ticated relations based on entities. We formulated several different tasks as ques-tion answering problems and tested the proposed model. Experiments reportedsatisfying results.

Keywords:

Text Comprehension, Entity Memory Network, Question Answering

It has long been a major concern of the natural language processing (NLP) commu-nity to enable computers to understand text as humans do. A lot of NLP tasks havebeen tensely studied towards this goal such as information retrieval, semantical rolelabelling, textual entailment and so on. Among them, questions answering is of greatimportance and has been a huge challenge. A question answering (QA) task is to predictan answer for a given question with regard to related information. It can be formulatedas a map f : { related text, question } −→ { answer } [12]. To predict the correctanswer, computers are ﬁrstly required to “understand” the text.Shallow features such as bag-of-words, token frequencies and so on are unable tocapture the rich information in text. Often outside knowledge is required towards bet-ter performances. Traditional approaches heavily rely on rules or structured knowledgedeveloped by experts or crowd sourcing [21,18]. Relational databases constructed frompredicate argument triples also serve as a source of knowledge [14,23]. Problems withthese approaches lie in at least two aspects. Firstly the construction of structured knowl-edge is both time and money consuming. Secondly it is a huge challenge to design a r X i v : . [ c s . C L ] F e b models ﬂexible and powerful enough to learn to employ these information extracted[6]. Thus the progress of using machine learning for QA has been slow.Recently the emergence of deep neural networks and distributed representationssheds light on such methods. Representing all the features using vectors provides a uni-ﬁed representational form for all the necessary information. Outside knowledge learntfrom large corpus can be encoded into word vectors. Information obtained locally isalso represented using vectors. Deep neural network models with many layers are de-signed to fuse information obtained from different sources [4,7].A notable breakthrough is to employ memories in neural networks. The represen-tative model is named the memory network [28]. The key of memory network is tostore historical sentences in a memory pool. The model is trained to look for relatedsentences when a question comes. Then based on the related sentences, an answer ispredicted for the question. Memory network remembers all sentences it has read so thatit can look for useful ones when facing questions. This model and its variants have beenproved useful in a series of tasks [28,25,1].One problem with memory networks is that using sentence vectors as elementaryunits of information makes it difﬁcult to fully explore the information contained intext. Often is the case that in a long sentence, only part of the sentence is related tothe questions. Therefore taking the whole sentence into consideration makes it hardto focus on the information that are related to questions. Besides, learning sentencerepresentations itself is a growing ﬁeld.We propose to focus on entities rather than sentences. Entities refer to anything thatexist in reality or are purely hypothetical. We assume that text can be projected to aworld of entities. The key of conducting comprehension and reasoning over text is toidentify its containing entities and analyze the states of these entities and the relationsbetween them. We keep a memory pool of entities and use the input sentences to updatethe states of these entities. Questions are answered based on the states of related entities.The proposed model deals with ﬁne-grained information by using entities. The intro-duced model is named as entity-based memory network . It is tested on several datasets,including the toy bAbI dataset [27], large movie review dataset [15] and the machinecomprehension test dataset [20]. Results show we have achieved satisfying results usingthe entity-based memory network. The rest of the paper is organized as follows: Section2 reviews some previous work. Section 3 describes our approaches and elaborates thedetails. Section 4 presents the experiments and the analysis. Section 5 concludes thepaper. QA has a long history and a lot of methods have been developed to address this problem[19,14,2]. Recently the development of neural models leads to series of work on ques-tion answering [5,4,7]. Among them closely related to our work is the Memory Network(MNN) [28]. The memory network contains four parts: the input module which convertssentences into vectors, the memory which keeps all sentence vectors a retrieval mod-ule and a response module. Whenever a question comes, the question is turned into avector and the question vector is used to retrieve the memory for related sentences. The response module is used to predict an answer based on the related sentences. The corecomponent is the memory pool that stores all the input sentences so that they can beretrieved later to answer questions. This model contains several neural networks whichare jointly optimized according to the task. Experiments on a toy dataset show that thismodel is able to answer simple questions according to the input text. Fig. 1(a) illustratesthe memory network. (a) the Memory Network (b) the Entity-based Memory Network

Fig. 1: Comparison of the memory network and the proposed entity-based memory net-work model. Sentences are decomposed into entities and then stored in the memory forlater retrieval.Later [10] propose the Dynamic Memory Network (DMNN) which introduces theattention mechanism into the memory network model. When retrieving memories, thelocation of the next related sentence is predicted according to the related sentencesidentiﬁed in the previous iterations. Using the attention mechanism, they obtain furtherimprovements. Some other work [25,1] propose other variants of MNN by introducingadditional memory network modules. These work focuses on storing sentence vectorsfor later retrieval with no exceptions. Most of them have been tested on the toy datasetbAbI [27] and are reported to have achieved satisfying results. When further tested onsome practical tasks, these models also show the ability to produce results as good asexisting state-of-the-art systems or even better results. Memory networks store sentencevectors as memories and have the superiority of processing information from a largescale. Experiment results they reported on a series of tasks are concrete proofs.But there is also a problem with the memory networks as we have stated. Takingsentence vectors as input means that it is difﬁcult to further analyze and take advantagesof relations between smaller text units, such as entities. For example, when an entity e a of sentence A interacts with another entity e b of sentence B , we have to take the wholesentences A and B into consideration rather than just focus on e a and e b . This inevitablybrings about noise and damages the comprehension of text. The failure of obtaining ﬁne-grained information prevents any further improvements. In the proposed entity-based model, we focus on entities directly and avoid bringing in redundant information. Firstly we use an example to illustrate how the model works. Below we show a pieceof text which contains 4 sentences and 2 questions. There are 7 entities in total, all ofthem underlined.

1) Mary moved to the bathroom. 2) John went to the hallway. 3) Where is Mary? Bathroom.4) Daniel went back to the hallway. 5) Sandra moved to the garden. 6) Where is Daniel? Hallway.

Fig. 2: An Example from bAbI, a toy dataset for question answering [27].This text is elaborated around the 7 entities. It describes how their states change(i.e., the change of a character’s location) when the story goes on. Note that here allthe entities are concrete concepts that exist in reality. It is also possible to talk aboutabstract concepts.The core of the proposed model are entities. We take Sentence 1 ( S ) as input andextract the entities it contains { Mary, bathroom } . Vectors representing the states of theseentities are initialized using some pre-learned word embeddings {−−−→ M ary , −−−−−−→ bathroom } and stored in a memory pool. Meanwhile, we turn S into a vector ( −→ S ) using an autoen-coder model . Then we use the sentence vector −→ S to update the entities’ states {−−−→ M ary , −−−−−−→ bathroom } . The goal is to reconstruct −→ S solely from {−−−→ M ary , −−−−−−→ bathroom } . In the sameway, we process the following text ( S ) and its containing entities (John, hallway) untilencounter a question ( S ). S is converted into a vector ( −→ S ) following the same methodthat processes previous input text. Then taking −→ S as input, we retrieve related entitiesfrom the memory which now stores all the entities (Mary, bathroom, John, hallway) thatappear before S . The related entities’ states are then used to produce a feature vector.In this case, (Mary and bathroom) are related to the question and their states are usedfor constructing the feature vector. Note the current states of the two entities (Mary andbathroom) are different from their initial values due to S . Based on the feature vector,we then use another neural network model to predict the answer to S .The model monitors the entities involved in text and keeps updating their statesaccording to the input. Whenever we have a question with regard to the text, we checkthe states of entities and predict an answer accordingly. The proposed model comprisesof 4 modules, as is shown in Fig. 3. Each module is designed for a unique purpose andtogether they construct the entity-based memory network model. Note that the sentence vector is not used to answer question directly and it is also plausible touse other models to learn sentence representation.

1. I: Input module. Take as input a sentence and turn it into a vector. Meanwhile,extract all the entities it contains. The question is also processed using this module.2. G: Generalization module. Update the states of related entities according to theinput. For entities that are not contained in the memory pool, create a new memoryslot for each of them and initialize these slots using pre-learned word embeddings.3. O: Output feature module. It is triggered whenever a question arrives. Retrieverelated entities according to the input question and then produce an output featurevector accordingly.4. R: Response module. Generate the response according to the output feature vector.Fig. 3: Architecture of the entity-based memory network.The model is divided into four modules which are shown in the ﬁgure using squares.

Here we present a formal description of the proposed model. Assume we have sentences S , S , ...S n whose entities are annotated in advance as e , e , ..., e m . Input Module

We ﬁrstly turn each sentence S i into its vector representation: S i = f ( S i ) (1) Generalization Module

For a sentence S i , we collect all the entities it contains { e i , ..., e ik , ..., e ij } .These entities’ states { e ik } are simultaneously updated according to S i as follows: { e ik } = arg min { e ik } ( | S (cid:48) i − S i | ); S (cid:48) i = f ( e i , ..., e ik , ..., e ij ); (2) f is to reconstruct S i only using the states of S i ’s containing entities { e ik } . { e ik } are updated to minimize the difference between S (cid:48) i and S i . Recall that S i is generatedusing f with the whole sentence S i as input. We compress the information carried by S i into a vector S i and then unfold it into { e ik } .After processing these sentences, we construct a memory pool which consists ofentities whose states are regarded as capable of representing the information carried bythe input text.Fig. 4: the Generalization Module: Using S as an example, the autoencoder is usedto convert the sentence into a vector S and the entities contained in S are used toreconstruct the sentence vector. Output Feature Module

Question q is turned into a vector q = f ( q ) and then q is usedto retrieve related entities from the memory pool.  O = Q = q , E = φ Q j − = h ( Q j − , e j − ) , j = 2 , , ...e j = arg max e k / ∈ E j − p ( e k , Q j − ); E j = E j − (cid:91) { e j } O j = u ( O j − , p ( e j , Q j − ) ∗ e j ) (3)At ﬁrst, Q in initialized using q . In the j th iteration, p ( e k , Q j − ) is the probability(or score) of e k being selected to compose the feature vector for answering q . Note thatevery e is considered only once. In Q , we consider the entity selected in the previousiteration. Q is kept updated using e and p .After several iterations, we use the ﬁnal O m as the output feature vector O . Notethat if the O ∗ does not change much between iterations, we will omit the remainingloops. This early-stop strategy helps reduce the time cost. Fig. 5: the Output Feature Module: In each iteration, entities are assigned differentscores which indicate their importance in constructing the output feature vector.

Response Module

Then we decide the answer using a ( q ) = v ( O ) . a ( q ) produces avector whose each item corresponds to one word in the vocabulary. a ( q ) i indicatesthe probability of word i being used as the correct answer. We choose the one withthe highest probability. Models like recurrent neural network can be used to output asentence as the answer. This is a supervised model and requires annotated data for the training. The trainingdata contains the input text, questions and answers. Also we need all the entities andentities that are related to the answer labeled.We deﬁne the function form for training as follows: As for f , many models, like therecurrent neural network, recursive neural network and so on [16,24,11], can be usedto convert a sentence into a vector. Here we use an Long Short-Term Memory (LSTM)autoencoder [13] which takes a word sequence as input and outputs the same sequence. f takes a list of entity states as input and tries to reconstruct S i . We use the GatedRecurrent Unit (GRU) [3]. S ki = tanh ( GRU ( S k − i , e ik )) S (cid:48) i = S ji (4)A GRU can be represented as the follows:  z jt = δ ( W z ∗ x t + U z ∗ h t − ) j ¯ h jt = tanh ( W ∗ x t + U ∗ ( r t ◦ h t − )) j r jt = δ ( W r ∗ x t + U r ∗ h t − ) h jt = (1 − z jt ) h jt − + z jt ¯ h jt (5) ◦ represents an element-wise multiplication. z jt and r jt are two gates controlling theimpact of historical h jt − on the current h jt . The GRU takes x t as input and updates thestate of the neuron to h jt . Compared with LSTM which it often replaces, it simpliﬁesthe computation while still keeps a memory of previous states. Therefore it takes lesstime to train GRU than LSTM.Our goal is to minimize the loss | S (cid:48) i − S i | . Using the stochastic gradient descent, weare able to train f and also update { e ik } . Note that the input module and the general-ization module do not interact with the remaining. Thus they can be trained in advance.The output feature module checks the memory pool repeatedly to select entities toform a feature vector:  Q j − = tanh ( GRU ( Q j − , e j − )) e j = arg max p ( e j , Q j − ) = arg max sigmoid ( W ∗ GRU ( e j , Q j − ) + b ) O j = tanh ( GRU ( O j − , p ( e j , Q j − ) ∗ e j )) (6)To generate the ﬁnal answer, we use a simple neural network which takes the featurevector O as input and predict a word as output. p w = v ( O ) = sof tmax ( tanh ( W (cid:48) ∗ O + b )) . The word with the highest probability is selected. Suppose a sentence is tobe generated, we use the GRU to update O and then generate the sentence { w ∗ } asfollows:  p i − w = sof tmax ( tanh ( W (cid:48) ∗ O i − + b )) w i − = arg max p i − w O i = tanh ( GRU ( O i − , w i − )) (7)Similar to [28], we use the stochastic gradient descent algorithm to minimize theloss function shown in Equation (8) over parameters. For an input S i and a given ques-tion q annotated with the correct answer word a and related entities { e r } , the loss func-tion is as follows: (cid:88) i (cid:54) = r max(0 , γ − ( p ( e r , q ) − p ( e i , q )))+ (cid:88) l (cid:54) = a max(0 , γ − ( p word a − p word l ))+ || Θ || (8)Here γ is the margin and || Θ || is the squared sum of all parameters which is usedfor regularization. Note that Θ does not include parameters of f and f . Their param-eters and states of entities are learned as described in Section 2.2. Word vectors used toinitialize entity states and words in autoencoder come from GloVe [17]. The dimensionis set to be 50. The model requires entities to be annotated in advance. In this work, we treat eachnoun and pronoun as an entity. Different words are regarded as different entities forsimplicity. This strategy saves us the effort of entity resolution which is a challenge formany languages. It also makes possible the application of the proposed model to entity resolution . For datasets with related entities annotated, we can use the loss functiondescribed above. But annotating the related entities is time and labour-costing. Mostdatasets available are not annotated. The weakly supervised learning can be applied tosuch data by trimming the loss function to (cid:88) l (cid:54) = k max(0 , γ − ( p word k − p word l )) + || Θ || (9)For unannotated data, a fully supervised training is also possible if we regard entitiescontained in questions as related entities or if we can use other methods to identifyentities that are believed to be related. To verify the effectiveness of the proposed model, we conduct experiments on severaldatasets, including a toy QA data set bAbI [27], the large movie review dataset forsentiment classiﬁcation [15] and the Machine Comprehension dataset (MC Test) [20].

The example shown in Fig. 1 is extracted from the bAbI dataset. It contains 20 top-ics, each of which contains short stories, simple questions with regard to the storiesand answers. The data is generated with a simulation which behaves like a classic textadventure game. According to some pre-set rules, stories are generated in a controlledcontext.Previous work reports extremely satisfying results using memory networks for mosttopics (around 90% for most of them). However, we notice an interesting thing that allof them with no exception fail on the problem of path-ﬁnding which is to predict a sim-ple path like ”north, west” given the locations of several subjects. Another one is thepositional reasoning. The Memory Network [27] reports accuracies of 36% and 65%for the two topics. The Dynamic Memory Network [10] reports accuracies of 35% and60%. The proposed model (Entity-MNN) reports accuracies of 53% and 67% respec-tively. It is still far from satisfying but the improvements on the two tasks indicatesthe superiority of the entity-based memory network. For the whole dataset, we reportmean error rates about 12%, comparable to 3.2 to about 24 reported by previous work[25,10,27].The data is generated in a controlled text. As we know, QA systems trained oncontrolled text normally suffers when moving to real world problems [6]. Results onthis toy dataset is not as convincing as that on practical tasks. Given how the bAbI datais generated, it is easy to achieve a 100% accuracy if we do simple reverse engineeringto identify the entities and rules. The good results of memory networks, including ourmodel, can not be solely attributed to their ability of comprehension. It may be partlydue to their ability of inducting the entities and rules from text. We treat each mentions of entities as different one when processing the text and ask questionsabout which of these mentions refer to the same entities.0

We tested the proposed model on a dataset constructed from children stories. The ma-chine comprehension test (MCTest) dataset [20] has 500 stories and 2000 questions(MC500). All of them are multiple choice reading comprehension questions. An addi-tional smaller dataset with 160 stories and 640 questions (MC160) is also included inthe MCtest data and used in our work.Since the proposed model does not consider the form of multiple choice questions,we need to convert MCTest data into suitable formats ﬁrstly. When answering a multiplechoice question, one is provided with several alternatives of which at least one is correct.These alternatives can be regarded as information known.For a question, we replace the “Wh-” words using each alternative and Each alter-native is turned to a new declarative sentences. These generated declarative sentencesare generally understandable though may not be grammatically correct. Then we usethe proposed system to decide whether the generated sentences are correct or wrong.However, we do not distinguish between questions with only one answer and thosewith more than one answers as these newly generated sentences are treated separately.In other words, all questions are treated as having multiple answers.The MCTest contains only hundreds of stories and is usually used for test only asstatistical models normally require a large amount of training data. However, we stillobtain satisfying results using this dataset. Table 1 demonstrates the effectiveness of theentity-based model on the MCTest dataset. We outperform the previous state-of-the-art[26,22] on both MC160 and MC500. Our model does not employ rich semantic featuresas others do, and hence is easy to be migrated to languages aside from English.

Sys. Acc.(%) MC160 Acc.(%) MC500Type Single Multiple Average Single Multiple AverageRichardson’13 [20] 76.8 62.5 69.2 68.0 59.5 63.3Wang’15 [26] 84.2 67.9 75.3 72.1 67.9 69.9Sachan’16 [22] - - - 72.0 68.9 70.3EntityMNN Average=76.1 Average=76.6

Table 1: Results on Machine Comprehension Test

We further tested our model on the Large Movie Review Dataset [15], which is a col-lection of 50,000 reviews from IMDB, about 30 reviews per movie. Each review isassigned a score from 1 (very negative) to 10 (very positive). The ratio of positive sam-ples to negative samples is 50:50. Following the previous work [15], we only considerpolarized samples with scores no greater than 4 or no smaller than 7.For each review, we present it as a short story and then add a question “what is theopinion?”. The answer is either “negative” or “positive”. In this way we turn this taskinto a question answering problem. Note that although here the answer to a question iseither “negative” or “positive”, we do not put any constraints on the output. It is treatedin the same way as open domain question answering and the system is expected to learnto predict the output by itself.

Table 2: Results on Large Movie DatasetWe do not use the full dataset as the training takes a long time. We randomly select10K samples (5K negative + 5K positive) for training and another 10K for test. Weobtain an accuracy of 97.2% on the subset which is higher than previous work [15,8,9]as is shown in Table 2. By exploring relations between entities, we consider informationthat is usually not included for classiﬁcation tasks and obtain better results.

The proposed model is designed based on the assumption that entities are the coreof text. By updating the states of entities, information carried by text is encoded intoentities. Thus all questions which are related to the text can be answered based onentities solely.Using entities enable us to break a sentence into smaller text units and analyze textfrom a smaller scale. As stated, if one entity e i in sentence S a interacts with anotherentity e j in sentence S b , dealing with e i and e j directly is much easier than dealingwith S a and S b . Therefore the proposed model overcomes this problem as has beenproven in our experiments. A shortcoming with the proposed model is that, it cannothandle text that contains very few entities. Also hidden entities are not considered. Aswe know, pro-drop languages, like Japanese and Chinese, tend to omit certain classes ofpronouns when they are inferable. The proposed model will encounter problems whendealing with such text. This work presents the entity-based memory network model for text comprehension.All the information conveyed by text is encoded into the states of its containing entitiesand questions regarded to the text are answered using these entities. Experiments onseveral tasks have proven the effectiveness of the proposed model. The proposed modelis based on the assumption that entities can express all the information of text. In futureresearch, we will further explore its ability by considering more components in text.