[PDF] BookQA: Stories of Challenges and Opportunities

Abstract

We present a system for answering questions based on the full text of books (BookQA), which first selects book passages given a question at hand, and then uses a memory network to reason and predict an answer. To improve generalization, we pretrain our memory network using artificial questions generated from book sentences. We experiment with the recently published NarrativeQA corpus, on the subset of Who questions, which expect book characters as answers. We experimentally show that BERT-based retrieval and pretraining improve over baseline results significantly. At the same time, we confirm that NarrativeQA is a highly challenging data set, and that there is need for novel research in order to achieve high-precision BookQA results. We analyze some of the bottlenecks of the current approach, and we argue that more research is needed on text representation, retrieval of relevant passages, and reasoning, including commonsense knowledge.

Full PDF

BBookQA: Stories of Challenges and Opportunities

Stefanos Angelidis Lea Frermann Diego Marcheggiani Roi Blanco Llu´ıs M`arquez Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh School of Computing and Information Systems, The University of Melbourne Amazon Research [email protected] [email protected] { marchegg,roiblan,lluismv } @amazon.com Abstract

We present a system for answering questionsbased on the full text of books (BookQA),which ﬁrst selects book passages given a ques-tion at hand, and then uses a memory net-work to reason and predict an answer. To im-prove generalization, we pretrain our mem-ory network using artiﬁcial questions gener-ated from book sentences. We experimentwith the recently published NarrativeQA cor-pus, on the subset of

Who questions, whichexpect book characters as answers. We ex-perimentally show that BERT-based retrievaland pretraining improve over baseline resultssigniﬁcantly. At the same time, we conﬁrmthat NarrativeQA is a highly challenging dataset, and that there is need for novel researchin order to achieve high-precision BookQA re-sults. We analyze some of the bottlenecks ofthe current approach, and we argue that moreresearch is needed on text representation, re-trieval of relevant passages, and reasoning, in-cluding commonsense knowledge.

Considerable volume of research work has lookedinto various Question Answering (QA) settings,ranging from retrieval-based QA (Voorhees, 2001)to recent neural approaches that reason overKnowledge Bases (KB) (Bordes et al., 2014), orraw text (Shen et al., 2017; Deng and Tam, 2018;Min et al., 2018). In this paper we use the Nar-rativeQA corpus (Kocisky et al., 2018) as a start-ing point and focus on the task of answeringquestions from the full text of books, which wecall BookQA. BookQA has unique characteristicswhich prohibit the direct application of current QAmethods. For instance, (a) books are usually or-ders of magnitude longer than the short texts (e.g., ∗ Work done while ﬁrst author was interning at Amazon.

Wikipedia articles) used in neural QA architec-tures; (b) many facts about a book story are nevermade explicit, and require external or common-sense knowledge to infer them; (c) the QA systemcannot rely on pre-existing KBs; (d) traditionalretrieval techniques are less effective in selectingrelevant passages from self-contained book sto-ries (Kocisky et al., 2018); (e) collecting human-annotated BookQA data is a signiﬁcant challenge;(f) stylistic disparities in the language used amongdifferent books may hinder generalization.Additionally, the style of book questions mayvary signiﬁcantly, with different approaches be-ing potentially useful for different question types:from queries about story facts that have entitiesas answers (e.g.,

Who and

Where questions); toopen-ended questions that require the extraction orgeneration of longer answers (e.g.,

Why or How questions). The difference in reasoning requiredfor different question types can make it very hardto draw meaningful conclusions.For this reason, we concentrate on the taskof answering

Who questions, which expect bookcharacters as answers (e.g., “Who is Harry Pot-ter’s best friend?” ). This task allows to simplifythe output and evaluation (we look for entities, andwe can apply precision-based and ranking evalu-ation metrics), but still retains the important ele-ments of the original NarrativeQA task, i.e., theneed to explore over the full content of the bookand to reason over a deep understanding of the nar-rative. Table 1 exempliﬁes the diversity and com-plexity of

Who questions in the data, by listing aset of questions from a single book, which requireincreasingly complex types of reasoning.NarrativeQA (Kocisky et al., 2018) is the ﬁrstpublicly available dataset for QA over long nar-ratives, namely the full text of books and moviescripts. The full-text task has only been addressed a r X i v : . [ c s . C L ] O c t ho is Emily in love with?Who is Emily imprisoned by?Who helps Emily escape from the castle?Who owns the castle in which Emily is imprisoned?Who became Emily’s guardian after her father’s death? Table 1:

Who questions from NarrativeQA for the book

The Mysteries of Udolpho , by Ann Radcliffe. The di-versity and complexity of questions in the corpus re-mains high, even when considering only the subset of

Who questions that expect characters as answers. by Tay et al. (2019), who proposed a curricu-lum learning-based two-phase approach ( contextselection and neural inference ). More papershave looked into answering NarrativeQA’s ques-tions from only book/movie summaries (Indurthiet al., 2018; Bauer et al., 2018; Tay et al., 2018a,b;Nishida et al., 2019). This is a fundamentally sim-pler task, because: i) the systems need to reasonover a much shorter context, i.e., the summary;and ii) there is the certainty that the answer can befound in the summary. This paper is another stepin the exploration of the full NarrativeQA task,and embraces the goal of ﬁnding an answer in thecomplete book text. We propose a system thatﬁrst selects a small subset of relevant book pas-sages, and then uses a memory network to reasonand extract the answer from them. The networkis speciﬁcally adapted for generalization acrossbooks. We analyze different options for selectingrelevant contexts, and for pretraining the memorynetwork with artiﬁcially created question–answerpairs. Our key contributions are: i) this is the ﬁrstsystematic exploration of the challenges in full-text BookQA, ii) we present a full pipeline frame-work for the task, iii) we publish a dataset of

Who questions which expect book characters as an an-swer, and iv) we include a critical discussion onthe shortcomings of the current QA approach, andwe discuss potential avenues for future research.

NarrativeQA was created using a large annotationeffort, where participants were shown a human-curated summary of a book/script and were askedto produce question-answer pairs without referringto the full story . The main task of interest is toanswer the questions by looking at the full story and not at the summary, thus ensuring that an-swers cannot be simply copied from the story. Thefull corpus contains 1,567 stories (split equally be-tween books and movies) and 46,765 questions. We restrict our study to

Who questions about books , which have book characters as answers(e.g., “Who is charged with attempted murder?” ).Using the book preprocessing system, book-nlp(see Section 3.1), and a combination of automaticand crowdsourced efforts, we obtained a total of3,427 QA pairs, spanning 614 books. The length of books and limited annotated dataprohibit the application of end-to-end neural QAmodels that reason over the full text of a book.Instead, we opted for a pipeline approach, whosecomponents are described below.

Books and questions are preprocessed in advanceusing the book-nlp parser (Bamman et al., 2014),a system for character detection and shallow pars-ing in books (Iyyer et al., 2016; Frermann andSzarvas, 2017) which provides, among others:sentence segmentation, POS tagging, dependencyparsing, named entity recognition, and corefer-ence resolution. The parser identiﬁes and clusterscharacter mentions, so that all coreferent (director pronominal) character mentions are associatedwith the same unique character identiﬁer.

In order to make inference over book text tractableand give our model a better chance at predictingthe correct answer, we must restrict the context toonly a small number of book sentences. We de-veloped two context selection methods to retrieverelevant book passages, which we deﬁne as win-dows of 5 consecutive sentences:

IR-style selection (BM25F):

We constructed asearchable book index to store individual booksentences. We replace every book character men-tion, including pronoun references, with the char-acter’s unique identiﬁer. At retrieval time, we sim-ilarly replace character mentions in each question,and rank passages from the corresponding bookusing BM25F (Zaragoza et al., 2004).

BERT-based selection:

We developed a neuralcontext selection method, based on the BERT lan-guage representation model (Devlin et al., 2019).A pretrained BERT model is ﬁne-tuned to predict To obtain the BookQA data, follow the instructions at: https://github.com/stangelid/bookqa-who . nitialization: Query : q t =0 = avg (v qw , . . . , v qw m ) Keys : m ini = avg (v sw , . . . , v sw l ) Values : m outi = avg (v c ∈ s , . . . ) Candidates : c j = v c j At Hop t: a ti = sparsemax (q t R t m i ) o t = (cid:88) i a ti m outi q t +1 = q t + o t After last hop: p ( c j ) = softmax ( o h C v c j ) Figure 1: Overview of our Key-Value Memory Network for BookQA. Encodings of questions, keys (selectedsentences), and values (characters mentioned in those sentences) are loaded. After multiple hops of inference, themodel’s output is compared against the candidate answers’ encodings to make a prediction. if a sentence is relevant to a question, using posi-tive ( questions, summary sentence ) training pairswhich have been heuristically matched. Randomlysampled negative pairs were also used. At retrievaltime, a question is used to retrieve relevant pas-sages from the full text of a book.

Having replaced character mentions in questionsand books with character identiﬁers, we ﬁrst pre-train word2vec embeddings (Mikolov et al., 2013)for all words and book characters in our corpus. Our neural inference model is a variant of the Key-Value Memory Network (KV-MemNet) (Milleret al., 2016), which has been previously applied toQA tasks over KBs and short texts. The originalmodel was designed to handle a ﬁxed set of poten-tial answers across all QA examples, as do mostneural QA architectures. This comes in contrastwith our task, where the pool of candidate charac-ters is different for each book. Our KV-MemNetvariant, illustrated in Figure 1, uses a dynamic out-put layer where different candidate answers aremade available for different books, while the re-maining model parameters are shared.A question is initially represented as q , i.e.,the average of its word embeddings (gray vec-tor). The Key memories m in1 . . . m ink (purple vec-tors) are ﬁlled with the k most relevant sentences,as retrieved from the context selection step, us- Character identiﬁers are treated like all other tokens. Experiments with more sophisticated question/sentencerepresentation variants showed no signiﬁcant improvements. ing the average of their word embeddings.

Value memories m out1 . . . m outk (green vectors) containthe average embedding of all characters mentionedin the respective sentence, or a padding vector ifno character is mentioned. Candidate embeddings c . . . c n (orange vectors) hold the embeddings ofevery character in the current book. The modelmakes multiple reasoning hops t = 1 . . . h overthe memories. At each hop, q t is passed throughlinear layer R t and is then compared againstall key memories. The sparsemax -normalized(Martins and Astudillo, 2016) attention weights a . . . a k are then used for obtaining output vec-tor o t , as the weighted average of value memo-ries. The process is repeated h times, and the ﬁnaloutput is passed through linear layer C , before be-ing compared against all candidate vectors via dot-product, to obtain the ﬁnal prediction. The modelis trained using negative log-likelihood. A signiﬁcant obstacle towards effective BookQAis the limited amount of data available for super-vised training. A potential avenue for overcomingthis is pretraining the neural inference model on anauxiliary task, for which we can generate orders ofmagnitude more training examples. To this end,we generated 688,228 artiﬁcial questions from thebook text using a set of simple pruning rules overthe dependency trees of book sentences. We usedall book sentences where a character mention isthe agent or the patient of an active voice verb, orthe patient of a passive voice verb. Two examples etric → P@1 P@5 MRR

Context selection → BM25F BERT BM25F BERT BM25F BERT

Baselines:

Book frequency 15.73 56.29 0.337Context frequency 10.53 13.80 51.42 53.02 0.276 0.305

KV-MemNet:

No pretraining 15.57 ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Precision scores (P@1, P@5), and Mean Reciprocal Rank (MRR) for frequency-based baselines and oursystem, with and without pretraining. We report average and standard deviation over 50 runs.

Original Sentence (Active):

Marriat had nsubj (cid:15) (cid:15) dobj (cid:15) (cid:15) a gift det (cid:15) (cid:15) prep (cid:15) (cid:15) for pobj (cid:15) (cid:15) the invention det (cid:15) (cid:15) prep (cid:15) (cid:15) of pobj (cid:15) (cid:15) stories. Artiﬁcial Question:

Who had a gift for invention?

Original Sentence (Passive):

Hermione was attacked nsubjpass (cid:15) (cid:15) auxpass (cid:15) (cid:15) prep (cid:15) (cid:15) by pobj (cid:15) (cid:15) another spell . det (cid:15) (cid:15) Artiﬁcial Question:

Who was attacked by spell?

Figure 2: Examples of artiﬁcial questions generatedfrom the dependency trees of an active voice (top) anda passive voice (bottom) sentence. The correct answer( verb’s subject ) is marked with blue , whereas the yel-low words are used in the question. The remainingwords are discarded by pruning the dependency tree. are illustrated in Figure 2: at the top, the activevoice sentence “Marriat had a gift for the inven-tion of stories.” is transformed into the question “Who had a gift for invention?” and, at the bot-tom, the passive voice sentence “Hermione wasattacked by another spell.” is transformed into thequestion “Who was attacked by a spell?” . Theprevious 20 book sentences, including the sourcesentence, are used as context during pretraining.

For every question, 100 sentences (top 20 passagesof ﬁve sentences) were selected as contexts usingour retrieval methods. We used word and bookcharacter embeddings of 100 dimensions. Thenumber of reasoning hops was set to 3. When nopretraining was performed, we trained on the realQA examples for 60 epochs, using Adam with ini- tial learning rate of − , which we reduced by10% every two epochs. Word and character em-beddings were ﬁxed during training. When us-ing pretraining, we trained the memory networkfor one epoch on the auxiliary task, including theembeddings. Then, the model was ﬁne-tuned asdescribed above on the real QA examples where,again, embeddings were ﬁxed. We use Preci-sion at the 1st and 5th rank (P@1 and P@5) andMean Reciprocal Rank (MRR) as evaluation met-rics. We adopted a 10-fold cross validation ap-proach and performed 5 trials for each cross vali-dation split, for a total of 50 experiments. Baselines:

We implemented a random baselineand two frequency-based baselines, where themost frequent character in the entire book (

Book frequency) or the selected context (

Context fre-quency) was selected as the answer.

Our main results are presented in Table 2. Firstly,we observe one of the dataset’s biases, as thebook’s most frequent character is the correct an-swer in more than 15% of examples, whereas se-lecting a character at random would only yield thecorrect answer 2.5% of the time.With regards to our BookQA pipeline, the re-sults conﬁrm that BookQA is a very challengingtask. Without pretraining, our KV-MemNet whichuses IR contexts achieves 15.57% P@1, and itis slightly outperformed by its BERT-based coun-terpart. When pretraining the memory networkwith artiﬁcial questions, the BERT-based modelachieves 18.73% P@1. The same trend is ob-served with the other metrics.

Number of hops:

We also calculated the impactof the number of hops with respect to the P@1 fora pretrained model ﬁne-tuned with BERT-selected Despite the similar performance to the Book frequencybaseline, we did not observe that our model was systemati-cally selecting the most frequent character as the answer. P @ Figure 3: P@1 for dif-ferent number of hops. P @ BERTBM25F

Figure 4: P@1 for varying contextsizes from BM25F and BERT. correct character mentioned incontext BM25F 69.7%BERT 74.7%full evidence found in context BM25F 27%partial evidence found in context 47%no evidence found in context 26%

Table 3: Percentage of contexts where the correctcharacter is mentioned (top). Percentage of contextswhere full/partial/no evidence for the answer wasfound according to crowd-workers who examined asample of 100 cases (bottom). contexts. Figure 3 shows that performance in-creases up to 3 hops and then it stabilizes.

Context size:

We expected the context size (i.e.,the number of retrieved sentences that we storein the memory slots of our KV-MemNet) to sig-niﬁcantly affect performance. Smaller contexts,obtained by only retrieving the topmost relevantpassages, might miss important evidence for an-swering a question at hand. Conversely, largercontexts might introduce noise in the form of ir-relevant sentences that hinder inference. Figure 4shows the performance of our method when vary-ing the number of context sentences (or, equiva-lently, memory slots). The neural inference modelstruggles for very small context sizes and achievesits best performance for 75 and 100 context sen-tences obtained by BM25F and BERT, respec-tively. For both alternatives, we observe no furtherimprovements for larger contexts.

Pretraining size & epochs:

A key component ofour BookQA framework is the pretraining of ourneural inference model with artiﬁcially generatedquestions. Although it helped achieve the high-est percentage of correctly answered questions, theperformance gains were relatively small given thenumber of artiﬁcial questions used to pretrain themodel. We further investigated the effect of pre-training by varying the number of artiﬁcial ques-tions used during training and the number of pre-training epochs. Figure 5 shows the QA perfor-mance achieved on the real BookQA questions(using BM25F or BERT contexts) after pretrain-ing on a randomly sampled subset of the artiﬁcialquestions. For our BERT-based variant, the pen-centage of correctly answered questions increasessteadily, but ﬂattens out when reaching 75% ofpretraining set usage. On the contrary, when usingBM25F contexts we achieved insigniﬁcant gains,with performance appearing constrained by thequality of retrieved passages. In Figure 6 we show P @ BERTBM25F

Figure 5: P@1 for varying percentage of pretrainingquestions used (BM25F and BERT contexts). P @ BERTBM25F

Figure 6: P@1 as a function of pretraining epochs forBM25F and BERT contexts.

P@1 scores as a function of the number of pre-training epochs. Best performance is achieved af-ter only one epoch for both variants, indicatingthat further pretraining might cause the model tooverﬁt to the simpler type of reasoning requiredfor answering artiﬁcial questions.

Despite the limitation to

Who questions, the em-ployment of strong models for context selectionand neural inference, and our pretraining efforts,the overall BookQA accuracy remains modest, asour best-performing system achieves a P@1 scorebelow 20%. Even when we only allowed our sys-tem to answer if it was very conﬁdent (accordingto the probability difference between top-rankedcandidate answers), it answered correctly 35% oftimes.e have identiﬁed a number of reasons whichinhibit better performance. Firstly, the passage se-lection process constrains the answers that can belogically inferred. We provide our ﬁndings in re-gards to this claim in Table 3. We calculated thatthe correct answer appears in the IR-selected con-texts in 69.7% of cases. For BERT-selected con-texts it appears in 74.7% of cases. In practice,however, these upper-bounds are not achievable;even when the correct answer appears in the con-text, there is no guarantee that enough evidenceexists to infer it. To further investigate this, weran a survey on Amazon Mechanichal Turk, whereparticipants were asked to indicate if the selectedcontext (IR-retrieved) contained partial or full ev-idence for answering a question. For a set of 100randomly sampled questions, participants foundfull evidence for answering a question in just 27%of cases. Only partial evidence was found in 47%of cases, and no evidence in the remaining 26%.Manual inspection of context sentences indi-cated that a common reason for the absence of fullevidence is the inherent vagueness of literary lan-guage. Repeated expressions or direct referencesto character names are often avoided by authors,thus requiring very accurate paraphrase detectionand coreference resolution. We believe that com-monsense knowledge is particularly crucial for im-proving BookQA. When exploring the output ofour system, we repeatedly found cases where themodel failed to arrive at the correct answer due tokey information being left implicit. Common ex-amples we identiﬁed were: i) character relation-ships which were clear to the reader, but neverexplicitly described (e.g., “Who did Mark’s bestfriend marry?” ); ii) the attitude of a character to-wards an event or situation (e.g., “Who was angryat the school’s policy?” ); iii) the relative succes-sion of events (e.g., “Who did Marriat talk to afterthe big ﬁght?” ). The injection of commonsenseknowledge into a QA system is an open problemfor general and, consequently, BookQA.In regards to pretraining, the lack of further im-provements is likely related to the difference in thetype of reasoning required for answering the artiﬁ-cial questions and the real book questions. By con-struction, the artiﬁcial questions will only requirethat the model accurately matches the source sen-tence, without the need for complex or multi-hopreasoning steps. In contrast, real book questionsrequire inference over information spread across many parts of a book. We believe that our pro-posed auxiliary task mainly helps the model byimproving the quality of word and book charac-ter representations. It is, however, clear from ourresults that pretraining is an important avenue forimproving BookQA accuracy, as it can increasethe number of training instances by many ordersof magnitude with limited human involvement.Future work should look into automatically con-structing auxiliary questions that better approxi-mate the types of reasoning required for realisticquestions on the content of books.We argue that the shortcomings discussed inprevious paragraphs, i.e., the lack of evidencein retrieved passages, the difﬁculty of long-termreasoning, the need for paraphrase detection andcommonsense knowledge, and the challenge ofuseful pretraining, are not speciﬁc to

Who ques-tions. On the contrary, we expect that the require-ment for novel research in these areas will gener-alize or, potentially, increase in the case of moregeneral questions (e.g., open-ended questions).

We presented a pipeline BookQA system to an-swer character-based questions on NarrativeQA,from the full book text. By constraining our studyto

Who questions, we simpliﬁed the task’s out-put space, while largely retaining the reasoningchallenges of BookQA, and our ability to drawconclusions that will generalize to other questiontypes. Given a

Who question, our system retrievesa set of relevant passages from the book, which arethen used by a memory network to infer the an-swer in multiple hops. A BERT-based trained re-trieval system, together with the usage of artiﬁcialquestion-answer pairs to pretrain the memory net-work, allowed our system to signiﬁcantly outper-form the lexical frequency-based baselines. Theuse of BERT-retrieved contexts improved upon asimpler IR-based method although, in both cases,only partial evidence was found in the selectedcontexts for the majority of questions. Increas-ing the number of retrieved passages did not resultin better performance, highlighting the signiﬁcantchallenge of accurate context selection. Pretrain-ing on artiﬁcially generated questions providedpromising improvements, but the automatic con-struction of realistic questions that require multi-hop reasoning remains an open problem. Theseresults conﬁrm the difﬁculty of the BookQA chal-enge, and indicate that there is need for novel re-search in order to achieve high-quality BookQA.Future work on the task must focus on severalaspects of the problem, including: (a) improv-ing context selection, by combining IR and neu-ral methods to remove noise in the selected pas-sages, or by jointly optimizing for context selec-tion and answer extraction (Das et al., 2019); (b)using better methods for encoding questions, sen-tences, and candidate answers, as embedding av-eraging results in information loss; (c) pretrainingtactics that better mimic the real BookQA task;(d) incorporation of commonsense knowledge andstructure, which was not addressed in this paper.

Acknowledgments

We would like to thankHugo Zaragoza and Alex Klementiev for theirvaluable insights, feedback and support on thework presented in this paper.

References

David Bamman, Ted Underwood, and Noah A. Smith.2014. A bayesian mixed effects model of literarycharacter. In

Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 370–379. Associa-tion for Computational Linguistics.Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for generative multi-hop question an-swering tasks. In

Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 4220–4230, Brussels, Belgium.Association for Computational Linguistics.Antoine Bordes, Sumit Chopra, and Jason Weston.2014. Question answering with subgraph embed-dings. In

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 615–620. Association for Compu-tational Linguistics.Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain questionanswering. In

ICLR 2019 .Haohui Deng and Yik-Cheung Tam. 2018. Read andcomprehend by gated-attention reader with more be-lief. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Student Research Work-shop , pages 83–91, New Orleans, Louisiana, USA.Association for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Lea Frermann and Gy¨orgy Szarvas. 2017. Inducingsemantic micro-clusters from deep multi-view rep-resentations of novels. In

Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 1873–1883. Associationfor Computational Linguistics.Sathish Reddy Indurthi, Seunghak Yu, Seohyun Back,and Heriberto Cuay´ahuitl. 2018. Cut to the chase:A context zoom-in network for reading comprehen-sion. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing , pages 570–575, Brussels, Belgium. Associationfor Computational Linguistics.Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor-dan Boyd-Graber, and Hal Daum´e III. 2016. Feud-ing families and former friends: Unsupervised learn-ing for dynamic ﬁctional relationships. In

Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages1534–1544. Association for Computational Linguis-tics.Tomas Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gabor Melis, andEdward Grefenstette. 2018. The narrativeqa readingcomprehension challenge.

Transactions of the Asso-ciation for Computational Linguistics , 6:317–328.Andr´e F. T. Martins and Ram´on F. Astudillo. 2016.From softmax to sparsemax: A sparse model of at-tention and multi-label classiﬁcation. In

Proceed-ings of the 33rd International Conference on Inter-national Conference on Machine Learning - Volume48 , ICML’16, pages 1614–1623. JMLR.org.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In

Proceedings of the 26th International Con-ference on Neural Information Processing Systems -Volume 2 , NIPS’13, pages 3111–3119, USA. CurranAssociates Inc.Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for di-rectly reading documents. In

Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 1400–1409. Associa-tion for Computational Linguistics.Sewon Min, Victor Zhong, Richard Socher, and Caim-ing Xiong. 2018. Efﬁcient and robust question an-swering from minimal context over documents. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:ong Papers) , pages 1725–1735. Association forComputational Linguistics.Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazu-toshi Shinoda, Atsushi Otsuka, Hisako Asano, andJunji Tomita. 2019. Multi-style generative readingcomprehension. In

Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 2273–2284, Florence, Italy. Associa-tion for Computational Linguistics.Yelong Shen, Po-Sen Huang, Jianfeng Gao, andWeizhu Chen. 2017. Reasonet: Learning to stopreading in machine comprehension. In

Proceedingsof the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , KDD’17, pages 1047–1055, New York, NY, USA. ACM.Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018a.Multi-granular sequence encoding via dilated com-positional units for reading comprehension. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 2141–2151, Brussels, Belgium. Association for Computa-tional Linguistics.Yi Tay, Luu Anh Tuan, Siu Cheung Hui, and JianSu. 2018b. Densely connected attention propaga-tion for reading comprehension. In

Proceedings ofthe 32Nd International Conference on Neural Infor-mation Processing Systems , NIPS’18, pages 4911–4922, USA. Curran Associates Inc.Yi Tay, Shuohang Wang, Anh Tuan Luu, Jie Fu,Minh C. Phan, Xingdi Yuan, Jinfeng Rao, Siu Che-ung Hui, and Aston Zhang. 2019. Simple and effec-tive curriculum pointer-generator networks for read-ing comprehension over long narratives. In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4922–4931, Florence, Italy. Association for Computa-tional Linguistics.Ellen M. Voorhees. 2001. The trec question answeringtrack.

Natural Language Engineering , 7(4):361–378.Hugo Zaragoza, Nick Craswell, Michael J Taylor,Suchi Saria, and Stephen E Robertson. 2004. Mi-crosoft cambridge at trec 13: Web and hard tracks.In