[PDF] A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

Abstract

Enabling a computer to understand a document so that it can answer comprehension questions is a central, yet unsolved goal of NLP. A key factor impeding its solution by machine learned systems is the limited availability of human-annotated data. Hermann et al. (2015) seek to solve this problem by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and show that a neural network can then be trained to give good performance on this task. In this paper, we conduct a thorough examination of this new reading comprehension task. Our primary aim is to understand what depth of language understanding is required to do well on this task. We approach this from one side by doing a careful hand-analysis of a small subset of the problems and from the other by showing that simple, carefully designed systems can obtain accuracies of 73.6% and 76.6% on these two datasets, exceeding current state-of-the-art results by 7-10% and approaching what we believe is the ceiling for performance on this task.

Full PDF

AA Thorough Examination of theCNN/Daily Mail Reading Comprehension Task

Danqi Chen and

Jason Bolton and

Christopher D. Manning

Computer Science Stanford UniversityStanford, CA 94305-9020, USA { danqi,jebolton,manning } @cs.stanford.edu Abstract

Enabling a computer to understand a docu-ment so that it can answer comprehensionquestions is a central, yet unsolved goalof NLP. A key factor impeding its solu-tion by machine learned systems is the lim-ited availability of human-annotated data.Hermann et al. (2015) seek to solve thisproblem by creating over a million trainingexamples by pairing

CNN and

Daily Mail news articles with their summarized bulletpoints, and show that a neural network canthen be trained to give good performanceon this task. In this paper, we conduct athorough examination of this new readingcomprehension task. Our primary aim is tounderstand what depth of language under-standing is required to do well on this task.We approach this from one side by doing acareful hand-analysis of a small subset ofthe problems and from the other by show-ing that simple, carefully designed systemscan obtain accuracies of 73.6% and 76.6%on these two datasets, exceeding currentstate-of-the-art results by 7–10% and ap-proaching what we believe is the ceilingfor performance on this task. Reading comprehension (RC) is the ability to readtext, process it, and understand its meaning. Howto endow computers with this capacity has been anelusive challenge and a long-standing goal of Arti-ﬁcial Intelligence (e.g., (Norvig, 1978)). Genuinereading comprehension involves interpretation of Our code is available at https://github.com/danqi/rc-cnn-dailymail . https://en.wikipedia.org/wiki/Reading_comprehension the text and making complex inferences. Humanreading comprehension is often tested by askingquestions that require interpretive understandingof a passage, and the same approach has been sug-gested for testing computers (Burges, 2013).In recent years, there have been several strandsof work which attempt to collect human-labeleddata for this task – in the form of document, ques-tion and answer triples – and to learn machinelearning models directly from it (Richardson etal., 2013; Berant et al., 2014; Wang et al., 2015).However, these datasets consist of only hundreds ofdocuments, as the labeled examples usually requireconsiderable expertise and neat design, makingthe annotation process quite expensive. The sub-sequent scarcity of labeled examples prevents usfrom training powerful statistical models, such asdeep learning models, and would seem to preventa system from learning complex textual reasoningcapacities.Recently, researchers at DeepMind (Hermannet al., 2015) had the appealing, original idea ofexploiting the fact that the abundant news articlesof

CNN and

Daily Mail are accompanied by bulletpoint summaries in order to heuristically createlarge-scale supervised training data for the readingcomprehension task. Figure 1 gives an example.Their idea is that a bullet point usually summarizesone or several aspects of the article. If the computerunderstands the content of the article, it should beable to infer the missing entity in the bullet point.This is a clever way of creating supervised datacheaply and holds promise for making progress ontraining RC models; however, it is unclear whatlevel of reading comprehension is actually neededto solve this somewhat artiﬁcial task and, indeed,what statistical models that do reasonably well onthis task have actually learned.In this paper, our aim is to provide an in-depthand thoughtful analysis of this dataset and what a r X i v : . [ c s . C L ] A ug @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel " @entity11 " will feature a capable but flawed @entity13 official named @entity14 who " also happens to be a lesbian . " the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of " @entity6 " books at @entity28 imprint @entity26 . PassageQuestion characters in " @placeholder " movies have gradually become more diverse

Answer @entity6

Figure 1: An example item from dataset

CNN .level of natural language understanding is neededto do well on it. We demonstrate that simple, care-fully designed systems can obtain high, state-of-the-art accuracies of 73.6% and 76.6% on

CNN and

Daily Mail respectively. We do a carefulhand-analysis of a small subset of the problemsto provide data on their difﬁculty and what kindsof language understanding are needed to be suc-cessful and we try to diagnose what is learned bythe systems that we have built. We conclude that:(i) this dataset is easier than previously realized,(ii) straightforward, conventional NLP systems cando much better on it than previously suggested,(iii) the distributed representations of deep learn-ing systems are very effective at recognizing para-phrases, (iv) partly because of the nature of thequestions, current systems much more have thenature of single-sentence relation extraction sys-tems than larger-discourse-context text understand-ing systems, (v) the systems that we present hereare close to the ceiling of performance for single-sentence and unambiguous cases of this dataset,and (vi) the prospects for getting the ﬁnal 20% ofquestions correct appear poor, since most of theminvolve issues in the data preparation which under-mine the chances of answering the question (coref-erence errors or anonymization of entities makingunderstanding too difﬁcult).

The RC datasets introduced in (Hermann et al.,2015) are made from articles on the news websites

CNN and

Daily Mail , utilizing articles and theirbullet point summaries. Figure 1 demonstrates The datasets are available at https://github.com/deepmind/rc-data . an example : it consists of a passage p , a ques-tion q and an answer a , where the passage is anews article, the question is a cloze-style task, inwhich one of the article’s bullet points has had oneentity replaced by a placeholder, and the answeris this questioned entity. The goal is to infer themissing entity (answer a ) from all the possible en-tities which appear in the passage. A news articleis usually associated with a few (e.g., 3–5) bulletpoints and each of them highlights one aspect of itscontent.The text has been run through a Google NLPpipeline. It it tokenized, lowercased, and namedentity recognition and coreference resolution havebeen run. For each coreference chain containing atleast one named entity, all items in the chain are re-placed by an @entity n marker, for a distinct index n . Hermann et al. (2015) argue convincingly thatsuch a strategy is necessary to ensure that systemsapproach this task by understanding the passage infront of them, rather than by using world knowl-edge or a language model to answer questions with-out needing to understand the passage. However,this also gives the task a somewhat artiﬁcial charac-ter. On the one hand, systems are greatly helped byentity recognition and coreference having alreadybeen performed; on the other, they suffer when ei-ther of these modules fail, as they do (in Figure 1,“the character” should probably be coreferent with@entity14; clearer examples of failure appear lateron in our data analysis). Moreover, this inabilityto use world knowledge also makes it much moredifﬁcult for a human to do this task – occasionallyit is very difﬁcult or impossible for a human to de-termine the correct answer when presented with anitem anonymized in this way.The creation of the datasets beneﬁts from thesheer volume of news articles available online, sothey offer a large and realistic testing ground forstatistical models. Table 1 provides some statis-tics on the two datasets: there are 380k and 879ktraining examples for CNN and

Daily Mail respec-tively. The passages are around 30 sentences and800 tokens on average, while each question con-tains around 12–14 tokens.In the following sections, we seek to more deeplyunderstand the nature of this dataset. We ﬁrst buildsome straightforward systems in order to get a bet-ter idea of a lower-bound for the performance of The original article can be found at . NN Daily Mail

CNN and

Daily Mail datasets. The avg. tokens and sentences in the pas-sage, the avg. tokens in the query, and the numberof entities are based on statistics from the trainingset, but they are similar on the development andtest sets.current NLP systems. Then we turn to data analysisof a sample of the items to examine their natureand an upper bound on performance.

In this section, we describe two systems we im-plemented – a conventional entity-centric classiﬁerand an end-to-end neural network. While Hermannet al. (2015) do provide several baselines for per-formance on the RC task, we suspect that theirbaselines are not that strong. They attempt to usea frame-semantic parser, and we feel that the poorcoverage of that parser undermines the results, andis not representative of what a straightforward NLPsystem – based on standard approaches to factoidquestion answering and relation extraction devel-oped over the last 15 years – can achieve. Indeed,their frame-semantic model is markedly inferiorto another baseline they provide, a heuristic worddistance model. At present just two papers areavailable presenting results on this RC task, bothpresenting neural network approaches: (Hermannet al., 2015) and (Hill et al., 2016). While the latteris wrapped in the language of end-to-end mem-ory networks, it actually presents a fairly simplewindow-based neural network classiﬁer running onthe CNN data. Its success again raises questionsabout the true nature and complexity of the RCtask provided by this dataset, which we seek toclarify by building a simple attention-based neuralnet classiﬁer.Given the (passage, question, answer) triple ( p, q, a ) , p = { p , . . . , p m } and q = { q , . . . , q l } are sequences of tokens for the passage and question sentence, with q containing exactly one“@placeholder” token. The goal is to infer the cor-rect entity a ∈ p ∩ E that the placeholder corre-sponds to, where E is the set of all abstract entitymarkers. Note that the correct answer entity mustappear in the passage p . We ﬁrst build a conventional feature-based classi-ﬁer, aiming to explore what features are effectivefor this task. This is similar in spirit to (Wang et al.,2015), which at present has very competitive per-formance on the MCTest RC dataset (Richardsonet al., 2013). The setup of this system is to designa feature vector f p,q ( e ) for each candidate entity e ,and to learn a weight vector θ such that the correctanswer a is expected to rank higher than all othercandidate entities: θ (cid:124) f p,q ( a ) > θ (cid:124) f p,q ( e ) , ∀ e ∈ E ∩ p \ { a } (1)We employ the following feature templates:1. Whether entity e occurs in the passage.2. Whether entity e occurs in the question.3. The frequency of entity e in the passage.4. The ﬁrst position of occurence of entity e in thepassage.5. n -gram exact match: whether there is an exactmatch between the text surrounding the place-holder and the text surrounding entity e . Wehave features for all combinations of matchingleft and/or right one or two words.6. Word distance: we align the placeholder witheach occurrence of entity e , and compute the av-erage minimum distance of each non-stop ques-tion word from the entity in the passage.7. Sentence co-occurrence: whether entity e co-occurs with another entity or verb that appearsin the question, in some sentence of the passage.8. Dependency parse match: we dependency parseboth the question and all the sentences in thepassage, and extract an indicator feature ofwhether w r −→ @placeholder and w r −→ e areboth found; similar features are constructed for@placeholder r −→ w and e r −→ w . Our neural network system is based on the

Atten-tiveReader model proposed by (Hermann et al.,2015). The framework can be described in thefollowing three steps (see Figure 2): @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel " @entity11 " will feature a capable but flawed @entity13 official named @entity14 who " also happens to be a lesbian . " the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of " @entity6 " books at @entity28 imprint @entity26 .

PassageQuestion characters in " @placeholder " movies have gradually become more diverse

Answer @entity6 … …… characters in " @placeholder " movies have gradually become more diverse

Passage Question entity6

Answer

Figure 2: Our neural network architecture for the reading comprehension task.

Encoding:

First, all the words are mapped to d -dimensional vectors via an embedding ma-trix E ∈ R d ×|V| ; therefore we have p : p , . . . , p m ∈ R d and q : q , . . . , q l ∈ R d .Next we use a shallow bi-directional recur-rent neural network (RNN) with hidden size ˜ h to encode contextual embeddings ˜ p i of eachword in the passage, −→ h i = RNN ( −→ h i − , p i ) , i = 1 , . . . , m ←− h i = RNN ( ←− h i +1 , p i ) , i = m, . . . , and ˜ p i = concat ( −→ h i , ←− h i ) ∈ R h , where h =2˜ h . Meanwhile, we use another bi-directionalRNN to map the question q , . . . , q l to anembedding q ∈ R h . We choose to use GatedRecurrent Unit (GRU) (Cho et al., 2014) inour experiments because it performs similarlybut is computationally cheaper than LSTM. Attention:

In this step, the goal is to compare thequestion embedding and all the contextual em-beddings, and select the pieces of informationthat are relevant to the question. We computea probability distribution α depending on thedegree of relevance between word p i (in itscontext) and the question q and then producean output vector o which is a weighted com-bination of all contextual embeddings { ˜ p i } : α i = softmax i q (cid:124) W s ˜ p i (2) o = (cid:88) i α i ˜ p i (3) W s ∈ R h × h is used in a bilinear term, whichallows us to compute a similarity between q and ˜ p i more ﬂexibly than with just a dotproduct. Prediction:

Using the output vector o , the systemoutputs the most likely answer using: a = arg max a ∈ p ∩ E W (cid:124) a o (4)Finally, the system adds a softmax functionon top of W (cid:124) a o and adopts a negative log-likelihood objective for training. Differences from (Hermann et al., 2015).

Ourmodel basically follows the

AttentiveReader . How-ever, to our surprise, our experiments observednearly – % improvement over the original Atten-tiveReader results on

CNN and

Daily Mail datasets(discussed in Sec. 4). Concretely, our model hasthe following differences: • We use a bilinear term, instead of a tanh layerto compute the relevance (attention) betweenquestion and contextual embeddings. The ef-fectiveness of the simple bilinear attentionfunction has been shown previously for neuralmachine translation by (Luong et al., 2015). • After obtaining the weighted contextual em-beddings o , we use o for direct prediction. Incontrast, the original model in (Hermann etal., 2015) combined o and the question em-bedding q via another non-linear layer beforemaking ﬁnal predictions. We found that wecould remove this layer without harming per-formance. We believe it is sufﬁcient for themodel to learn to return the entity to which itmaximally gives attention. • The original model considers all the wordsfrom the vocabulary V in making predictions.We think this is unnecessary, and only predictamong entities which appear in the passage.f these changes, only the ﬁrst seems important;the other two just aim at keeping the model simple. Window-based MemN2Ns (Hill et al., 2016).

Another recent neural network approach proposedby (Hill et al., 2016) is based on a memory net-work architecture (Weston et al., 2015). We thinkit is highly similar in spirit. The biggest differenceis their way of encoding passages: they demon-strate that it is most effective to only use a 5-wordcontext window when evaluating a candidate en-tity and they use a positional unigram approach toencode the contextual embeddings: if a windowconsists of 5 words x , . . . , x , then it is encodedas (cid:80) i =1 E i ( x i ) , resulting in separate embeddingmatrices to learn. They encode the 5-word windowsurrounding the placeholder in a similar way andall other words in the question text are ignored. Inaddition, they simply use a dot product to computethe “relevance” between the question and a contex-tual embedding. This simple model neverthelessworks well, showing the extent to which this RCtask can be done by very local context matching. For training our conventional classiﬁer, we use theimplementation of

LambdaMART (Wu et al., 2010)in the RankLib package. We use this ranking al-gorithm since our problem is naturally a rankingproblem and forests of boosted decision trees havebeen very successful lately (as seen, e.g., in manyrecent Kaggle competitions). We do not use all thefeatures of

LambdaMART since we are only scor-ing 1/0 loss on the ﬁrst ranked proposal, rather thanusing an IR-style metric to score ranked results. Weuse Stanford’s neural network dependency parser(Chen and Manning, 2014) to parse all our docu-ment and question text, and all other features canbe extracted without additional tools.For training our neural networks, we only keepthe most frequent |V| = 50 k words (includingentity and placeholder markers), and map all otherwords to an ¡unk¿ token. We choose word embed-ding size d = 100 , and use the -dimensionalpre-trained GloVe word embeddings (Penningtonet al., 2014) for initialization. The attention andoutput parameters are initialized from a uniformdistribution between ( − . , . , and the GRU https://sourceforge.net/p/lemur/wiki/RankLib/ . weights are initialized from a Gaussian distribution N (0 , . .We use hidden size h = 128 for CNN and 256for

Daily Mail . Optimization is carried out usingvanilla stochastic gradient descent (SGD), with aﬁxed learning rate of . . We sort all the examplesby the length of its passage, and randomly samplea mini-batch of size 32 for each update. We alsoapply dropout with probability . to the embed-ding layer and gradient clipping when the norm ofgradients exceeds .Additionally, we think the original indices ofentity markers are generated arbitrarily. We attemptto relabel the entity markers based on their ﬁrstoccurrence in the passage and question and ﬁndthat this step can make training converge faster aswell bring slight gains. We report both results (withand without relabeling) for future reference.All of our models are run on a single GPU(GeForce GTX TITAN X), with roughly a runtimeof 3 hours per epoch for CNN , and 12 hours perepoch for

Daily Mail . We run all the models up to epochs and select the model that achieves thebest accuracy on the development set.We run our models 5 times independently withdifferent random seeds and report average perfor-mance across the runs. We also report ensembleresults which average the prediction probabilitiesof the 5 models. Table 2 presents our main results. The conven-tional feature-based classiﬁer obtains . ac-curacy on the CNN test set. Not only does thissigniﬁcantly outperform any of the symbolic ap-proaches reported in (Hermann et al., 2015), it alsooutperforms all the neural network systems fromtheir paper and the best single-system result re-ported so far from (Hill et al., 2016). This suggeststhat the task might not be as difﬁcult as suggested,and a simple feature set can cover many of thecases. Table 3 presents a feature ablation analysisof our entity-centric classiﬁer on the developmentportion of the

CNN dataset. It shows that n -grammatch and frequency of entities are the two mostimportant classes of features.More dramatically, our single-model neural net-work surpasses the previous results by a large mar-gin (over 5%). The relabeling process further im- The ﬁrst occurring entity is relabeled as @entity1, andthe second one is relabeled as @entity2, and so on. odel CNN Daily MailDev Test Dev TestFrame-semantic model † † † † † ‡ ‡ ‡ ∗ ∗ N/A N/AOurs: Classiﬁer 67.1 67.9 69.1 68.3Ours: Neural net 72.5 72.7 76.9 76.0Ours: Neural net (ensemble) 76.2 ∗ ∗ ∗ ∗ Ours: Neural net (relabeling)

Ours: Neural net (relabeling, ensemble) ∗ ∗ ∗ ∗ Table 2: Accuracy of all models on the

CNN and

Daily Mail datasets. Results marked † are from(Hermann et al., 2015) and results marked ‡ are from (Hill et al., 2016). Classiﬁer and

Neural net denoteour entity-centric classiﬁer and neural network systems respectively. The numbers marked with ∗ indicatethat the results are from ensemble models.Features AccuracyFull model 67.1 − whether e is in the passage 67.1 − whether e is in the question 67.0 − frequency of e − position of e − n -gram match − word distance 65.4 − sentence co-occurrence 66.0 − dependency parse match 65.6Table 3: Feature ablation analysis of our entity-centric classiﬁer on the development portion of the CNN dataset. The numbers denote the accuracyafter we exclude each feature from the full system,so a low number indicates an important feature.proves the results by . and . , pushing upthe state-of-the-art accuracies to 73.6% and 76.6%on the two datasets respectively. The ensembles of models consistently bring further − gains.Concurrently with our paper, Kadlec et al. (2016)and Kobayashi et al. (2016) also experiment onthese two datasets and report competitive results.However, our model not only still outperformstheirs, but also appears to be structurally simpler.All these recent efforts converge to similar num- bers, and we believe that they are approaching theceiling performance of this task, as we will indicatein the next section. So far, we have good results via either of our sys-tems. In this section, we aim to conduct an in-depth analysis and answer the following questions:(i) Since the dataset was created in an automaticand heuristic way, how many of the questions aretrivial to answer, and how many are noisy and notanswerable? (ii) What have these models learned?What are the prospects for further improving them?To study this, we randomly sampled 100 exam-ples from the dev portion of the

CNN dataset foranalysis (see more details in Appendix A).

After carefully analyzing these 100 examples, weroughly classify them into the following categories(if an example satisﬁes more than one category, weclassify it into the earliest one):

Exact match

The nearest words around the place-holder are also found in the passage sur-rounding an entity marker; the answer is self-evident.

Sentence-level paraphrasing

The question textategory Question PassageExactMatch it ’s clear @entity0 is leaning to-ward @placeholder , says an ex-pert who monitors @entity0 . . . @entity116 , who follows @entity0 ’s operationsand propaganda closely , recently told @entity3 , it ’sclear @entity0 is leaning toward @entity60 in terms ofdoctrine , ideology and an emphasis on holding territoryafter operations . . . .Para-phrase @placeholder says he under-stands why @entity0 wo n’t playat his tournament . . . @entity0 called me personally to let me know thathe would n’t be playing here at @entity23 , ” @entity3 said on his @entity21 event ’s website . . . .Partialclue a tv movie based on @entity2 ’sbook @placeholder casts a @en-tity76 actor as @entity5 . . . to @entity12 @entity2 professed that his @entity11 is not a religious book . . . .Multiplesent. he ’s doing a his - and - her duetall by himself , @entity6 said of @placeholder . . . we got some groundbreaking performances , here too, tonight , @entity6 said . we got @entity17 , who willbe doing some musical performances . he ’s doing a his- and - her duet all by himself . . . .Coref.Error rapper @placeholder ” disgusted ,” cancels upcoming show for @en-tity280 . . . with hip - hop star @entity246 saying on @entity247that he was canceling an upcoming show for the @en-tity249 . . . . (but @entity249 = @entity280 = SAEs)Hard pilot error and snow were reasonsstated for @placeholder planecrash . . . a small aircraft carrying @entity5 , @entity6 and@entity7 the @entity12 @entity3 crashed a few milesfrom @entity9 , near @entity10 , @entity11 . . . .Table 4: Some representative examples from each category.is entailed/rephrased by exactly one sentencein the passage, so the answer can deﬁnitely beidentiﬁed from that sentence.

Partial clue

In many cases, even though we can-not ﬁnd a complete semantic match betweenthe question text and some sentence, we arestill able to infer the answer through partialclues, such as some word/concept overlap.

Multiple sentences

It requires processing multi-ple sentences to infer the correct answer.

Coreference errors

It is unavoidable that thereare many coreference errors in the dataset.This category includes those examples withcritical coreference errors for the answer en-tity or key entities appearing in the question.Basically we treat this category as “not an-swerable”.

Ambiguous or very hard

This category includesexamples for which we think humans are notable to obtain the correct answer (conﬁdently). No. Category (%)1 Exact match 132 Paraphrasing 413 Partial clue 194 Multiple sentences 25 Coreference errors 86 Ambiguous / hard 17Table 5: An estimate of the breakdown of thedataset into classes, based on the analysis of oursampled 100 examples from the

CNN dataset.Table 5 provides our estimate of the percentagefor each category, and Table 4 presents one repre-sentative example from each category. To our sur-prise, “coreference errors” and “ambiguous/hard”cases account for of this sample set, based onour manual analysis, and this certainly will be abarrier for training models with an accuracy muchabove 75% (although, of course, a model can some-times make a lucky guess). Additionally, only 2examples require multiple sentences for inference –ategory Classiﬁer Neural netExact match 13 (100.0%) 13 (100.0%)Paraphrasing 32 (78.1%) 39 (95.1%)Partial clue 14 (73.7%) 17 (89.5%)Multiple sentences 1 (50.0%) 1 (50.0%)Coreference errors 4 (50.0%) 3 (37.5%)Ambiguous / hard 2 (11.8%) 1 (5.9%)All 66 (66.0%) 74 (74.0%)Table 6: The per-category performance of our twosystems.this is a lower rate than we expected and Hermannet al. (2015) suggest. Therefore, we hypothesizethat in most of the “answerable” cases, the goal isto identify the most relevant (single) sentence, andthen to infer the answer based upon it.

Now, we further analyze the predictions of our twosystems, based on the above categorization.As seen in Table 6, we have the following obser-vations: (i) The exact-match cases are quite sim-ple and both systems get 100% correct. (ii) Forthe ambiguous/hard and entity-linking-error cases,meeting our expectations, both of the systems per-form poorly. (iii) The two systems mainly differ inparaphrasing cases, and some of the “partial clue”cases. This clearly shows how neural networks arebetter capable of learning semantic matches involv-ing paraphrasing or lexical variation between thetwo sentences. (iv) We believe that the neural-netsystem already achieves near-optimal performanceon all the single-sentence and unambiguous cases.There does not seem to be much useful headroomfor exploring more sophisticated natural languageunderstanding approaches on this dataset.

We brieﬂy survey other tasks related to readingcomprehension.

MCTest (Richardson et al., 2013) is an open-domain reading comprehension task, in the formof ﬁctional short stories, accompanied by multiple-choice questions. It was carefully created usingcrowd sourcing, and aims at a 7-year-old readingcomprehension level.On the one hand, this dataset has a high de-mand on various reasoning capacities: over ofthe questions require multiple sentences to answer and also the questions come in assorted categories( what , why , how , whose , which , etc). On the otherhand, the full dataset has only 660 paragraphs in to-tal (each paragraph is associated with 4 questions),which renders training statistical models (especiallycomplex ones) very difﬁcult.Up to now, the best solutions (Sachan et al.,2015; Wang et al., 2015) are still heavily relyingon manually curated syntactic/semantic features,with the aid of additional knowledge (e.g., wordembeddings, lexical/paragraph databases). Children Book Test (Hill et al., 2016) was de-veloped in a similar spirit to the

CNN / Daily Mail datasets. It takes any consecutive 21 sentencesfrom a children’s book – the ﬁrst 20 sentencesare used as the passage, and the goal is to infera missing word in the 21st sentence (question andanswer). The questions are also categorized by thetype of the missing word: named entity, commonnoun, preposition or verb. According to the ﬁrststudy on this dataset (Hill et al., 2016), a languagemodel (an n -gram model or a recurrent neural net-work) with local context is sufﬁcient for predictingverbs or prepositions; however, for named entitiesor common nouns, it improves performance to scanthrough the whole paragraph to make predictions.So far, the best published results are reported bywindow-based memory networks. bAbI (Weston et al., 2016) is a collection ofartiﬁcial datasets, consisting of 20 different reason-ing types. It encourages the development of mod-els with the ability to chain reasoning, induction/deduction, etc., so that they can answer a questionlike “The football is in the playground ” after read-ing a sequence of sentences “John is in the play-ground; Bob is in the ofﬁce; John picked up thefootball; Bob went to the kitchen.” Various types ofmemory networks (Sukhbaatar et al., 2015; Kumaret al., 2016) have been shown effective on thesetasks, and Lee et al. (2016) show that vector spacemodels based on extensive problem analysis canobtain near-perfect accuracies on all the categories.Despite these promising results, this dataset is lim-ited to a small vocabulary (only 100–200 words)and simple language variations, so there is still ahuge gap from real-world datasets that we need toﬁll in. In this paper, we carefully examined the recent

CNN / Daily Mail reading comprehension task. Ourystems demonstrated state-of-the-art results, butmore importantly, we performed a careful analysisof the dataset by hand.Overall, we think the

CNN / Daily Mail datasetsare valuable datasets, which provide a promisingavenue for training effective statistical models forreading comprehension tasks. Nevertheless, weargue that: (i) this dataset is still quite noisy due toits method of data creation and coreference errors;(ii) current neural networks have almost reached aperformance ceiling on this dataset; and (iii) the re-quired reasoning and inference level of this datasetis still quite simple.As future work, we need to consider how we canutilize these datasets (and the models trained uponthem) to help solve more complex RC reasoningtasks (with less annotated data).

Acknowledgments

We thank the anonymous reviewers for theirthoughtful feedback. Stanford University gratefullyacknowledges the support of the Defense AdvancedResearch Projects Agency (DARPA) Deep Explo-ration and Filtering of Text (DEFT) Program underAir Force Research Laboratory (AFRL) contractno. FA8750-13-2-0040. Any opinions, ﬁndings,and conclusion or recommendations expressed inthis material are those of the authors and do notnecessarily reﬂect the view of the DARPA, AFRL,or the US government.

References

Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, AbbyVander Linden, Brittany Harding, Brad Huang, PeterClark, and Christopher D. Manning. 2014. Modelingbiological processes for reading comprehension. In

Empirical Methods in Natural Language Processing(EMNLP) , pages 1499–1510.Christopher J.C. Burges. 2013. Towards the machinecomprehension of text: An essay. Technical report,Microsoft Research Technical Report MSR-TR-2013-125.Danqi Chen and Christopher Manning. 2014. A fast andaccurate dependency parser using neural networks.In

Empirical Methods in Natural Language Process-ing (EMNLP) , pages 740–750.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoder forstatistical machine translation. In

Empirical Methodsin Natural Language Processing (EMNLP) , pages1724–1734. Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In

Advances in Neural InformationProcessing Systems (NIPS) , pages 1684–1692.Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2016. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. In

International Conference on Learning Rep-resentations (ICLR) .Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and JanKleindienst. 2016. Text understanding with the at-tention sum reader network. In

Association for Com-putational Linguistics (ACL) .Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, and Ken-taro Inui. 2016. Dynamic entity representation withmax-pooling improves machine reading. In

NorthAmerican Association for Computational Linguistics(NAACL) .Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong, Ro-main Paulus, and Richard Socher. 2016. Ask meanything: Dynamic memory networks for natural lan-guage processing. In

International Conference onMachine Learning (ICML) .Moontae Lee, Xiaodong He, Wen-tau Yih, Jianfeng Gao,Li Deng, and Paul Smolensky. 2016. Reasoningin vector space: An exploratory study of questionanswering. In

International Conference on LearningRepresentations (ICLR) .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In

Empirical Methodsin Natural Language Processing (EMNLP) , pages1412–1421.Peter Norvig. 1978.

A Uniﬁed Theory of Inferencefor Text Understanding . Ph.D. thesis, University ofCalifornia, Berkeley.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In

Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 1532–1543.Matthew Richardson, Christopher J.C. Burges, and ErinRenshaw. 2013. MCTest: A challenge dataset forthe open-domain machine comprehension of text. In

Empirical Methods in Natural Language Processing(EMNLP) , pages 193–203.Mrinmaya Sachan, Kumar Dubey, Eric Xing, andMatthew Richardson. 2015. Learning answer-entailing structures for machine comprehension. In

Association for Computational Linguistics and In-ternational Joint Conference on Natural LanguageProcessing (ACL/IJCNLP) , pages 239–249.ainbayar Sukhbaatar, arthur szlam, Jason Weston, andRob Fergus. 2015. End-to-end memory networks. In

Advances in Neural Information Processing Systems(NIPS) , pages 2431–2439.Hai Wang, Mohit Bansal, Kevin Gimpel, and DavidMcAllester. 2015. Machine comprehension withsyntax, frames, and semantics. In

Associationfor Computational Linguistics and InternationalJoint Conference on Natural Language Processing(ACL/IJCNLP) , pages 700–706.Jason Weston, Sumit Chopra, and Antoine Bordes.2015. Memory networks. In

International Confer-ence on Learning Representations (ICLR) .Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2016. Towards AI-complete ques-tion answering: A set of prerequisite toy tasks. In

International Conference on Learning Representa-tions (ICLR) .Qiang Wu, Christopher J. Burges, Krysta M. Svore, andJianfeng Gao. 2010. Adapting boosting for informa-tion retrieval measures.

Information Retrieval , pages254–270.

A Samples and Labeled Categories fromthe

CNN

Dataset

For the analysis in Section 5, we uniformly sam-pled 100 examples from the development set ofthe

CNN dataset. Table 8 provides a full index listof our samples and Table 7 presents our labeledcategories.Category Sample IDsExact match (13) 8, 11, 23, 27, 28, 32, 43, 57, 63, 72, 86, 87, 99Sentence-level paraphrasing (41) 0, 2, 7, 9, 12, 14, 16, 18, 19, 20, 29, 30, 31, 34, 36,37, 39, 41, 42, 44, 47, 48, 52, 54, 58, 64, 65, 66, 69,73, 74, 78, 80, 81, 82, 84, 85, 90, 92, 95, 96Partial clues (19) 4, 17, 21, 24, 35, 38, 45, 53, 55, 56, 61, 62, 75, 83,88, 89, 91, 97, 98Multiple sentences (2) 5, 76Coreference errors (8) 6, 22, 40, 46, 51, 60, 68, 94Ambiguous or very hard (17) 1, 3, 10, 13, 15, 25, 26, 33, 49, 50, 59, 67, 70, 71, 77,79, 93Table 7: Our labeled categories of the 100 samples.