Recent Advances in Neural Question Generation
aa r X i v : . [ c s . C L ] J un Recent Advances in Neural Question Generation
Liangming Pan, Wenqiang Lei, Tat-Seng Chua and Min-Yen Kan
School of ComputingNational University of SingaporeSingapore 117417 { e0272310,wenql,kanmy,chuats } comp.nus.edu.sg Abstract
Emerging research in Neural Question Gener-ation (NQG) has started to integrate a largervariety of inputs, and generating questionsrequiring higher levels of cognition. Thesetrends point to NQG as a bellwether for NLP,about how human intelligence embodies theskills of curiosity and integration.We present a comprehensive survey of neuralquestion generation, examining the corpora,methodologies, and evaluation methods. Fromthis, we elaborate on what we see as emerg-ing on NQG’s trend: in terms of the learn-ing paradigms, input modalities, and cognitivelevels considered by NQG. We end by pointingout the potential directions ahead.
Question Generation (QG) concerns the task of“automatically generating questions from variousinputs such as raw text, database, or semantic rep-resentation” (Rus et al., 2008). People have theability to ask rich, creative, and revealing ques-tions (Rothe et al., 2017); e.g. , asking
Why didGollum betray his master Frodo Baggins? afterreading the fantasy novel
The Lord of the Rings .How can machines be endowed with the ability toask relevant and to-the-point questions, given var-ious inputs? This is a challenging, complementarytask to Question Answering (QA). Both QA andQG require an in-depth understanding of the in-put source and the ability to reason over relevantcontexts. But beyond understanding, QG addition-ally integrates the challenges of Natural LanguageGeneration (NLG), i.e. , generating grammaticallyand semantically correct questions.QG is of practical importance: in education,forming good questions are crucial for evaluatingstudents knowledge and stimulating self-learning.QG can generate assessments for course materi-als (Heilman and Smith, 2010) or be used as a component in adaptive, intelligent tutoring sys-tems (Lindberg et al., 2013). In dialog systems,fluent QG is an important skill for chatbots, e.g. ,in initiating conversations or obtaining specific in-formation from human users. QA and readingcomprehension also benefit from QG, by reducingthe needed human labor for creating large-scaledatasets. We can say that traditional QG mainlyfocused on generating factoid questions from asingle sentence or a paragraph, spurred by a seriesof workshops during 2008–2012 (Rus and Lester,2009; Rus et al., 2010, 2011, 2012).Recently, driven by advances in deep learning,QG research has also begun to utilize “neural”techniques, to develop end-to-end neural modelsto generate deeper questions (Chen et al., 2018)and to pursue broader applications (Serban et al.,2016; Mostafazadeh et al., 2016).While there have been considerable advancesmade in NQG, the area lacks a comprehensive sur-vey. This paper fills this gap by presenting a sys-tematic survey on recent development of NQG, fo-cusing on three emergent trends that deep learn-ing has brought in QG: (1) the change of learningparadigm, (2) the broadening of the input spec-trum, and (3) the generation of deep questions.
For the sake of clean exposition, we first providea broad overview of QG by conceptualizing theproblem from the perspective of the three intro-duced aspects: (1) its learning paradigm, (2) itsinput modalities, and (3) the cognitive level it in-volves. This combines past research with recenttrends, providing insights on how NQG connectsto traditional QG research.
QG research traditionally considers two funda-mental aspects in question asking: “What to ask”nd “How to ask”. A typical QG task considers theidentification of the important aspects to ask about(“what to ask”), and learning to realize such iden-tified aspects as natural language (“how to ask”).Deciding what to ask is a form of machine under-standing: a machine needs to capture importantinformation dependent on the target application,akin to automatic summarization. Learning how toask, however, focuses on aspects of the languagequality such as grammatical correctness, semanti-cally preciseness and language flexibility.Past research took a reductionist approach,separately considering these two problems of“what” and “how” via content selection and question construction . Given a sentence ora paragraph as input, content selection se-lects a particular salient topic worthwhile toask about and determines the question type(
What , When , Who , etc.). Approaches eithertake a syntactic (Gates, 2008; Liu et al., 2010;Heilman, 2011) or semantic (Yao et al., 2012;Lindberg et al., 2013; Mazidi and Nielsen, 2014;Chali and Hasan, 2015) tack, both starting by ap-plying syntactic or semantic parsing, respectively,to obtain intermediate symbolic representations.Question construction then converts intermediaterepresentations to a natural language question,taking either a tranformation- or template-based approach. The former (Ali et al., 2010; Pal et al.,2010; Heilman, 2011) rearranges the surface formof the input sentence to produce the question; thelatter (Chen and Mostow, 2009; Liu et al., 2012;Rokhlenko and Szpektor, 2013) generates ques-tions from pre-defined question templates. Un-fortunately, such QG architectures are limiting, astheir representation is confined to the variety of in-termediate representations, transformation rules ortemplates.In contrast, neural models motivate an end-to-end architectures. Deep learned frameworks con-trast with the reductionist approach, admitting ap-proaches that jointly optimize for both the “what”and “how” in an unified framework. The major-ity of current NQG models follow the sequence-to-sequence (Seq2Seq) framework that use a uni-fied representation and joint learning of contentselection (via the encoder) and question construc-tion (via the decoder). In this framework, tradi-tional parsing-based content selection has been re-placed by more flexible approaches such as atten-tion (Bahdanau et al., 2014) and copying mecha- nism (G ¨ulc¸ehre et al., 2016). Question construc-tion has become completely data-driven, requiringfar less labor compared to transformation rules,enabling better language flexibility compared toquestion templates.However, unlike other Seq2Seq learning NLGtasks, such as Machine Translation, Image Cap-tioning, and Abstractive Summarization, whichcan be loosely regarded as learning a one-to-onemapping, generated questions can differ signifi-cantly when the intent of asking differs ( e.g. , thetarget answer, the target aspect to ask about, andthe question’s depth). In Section 5, we summarizedifferent NQG methodologies based on Seq2Seqframework, investigating how some of these QG-specific factors are integrated with neural mod-els, and discussing what could be further explored.The change of learning paradigm in NQG era isalso represented by multi-task learning with otherNLP tasks, for which we discuss in Section 6.1. Question generation is an NLG task for whichthe input has a wealth of possibilities dependingon applications. While a host of input modalitieshave been considered in other NLG tasks, such astext summarization (Mani, 1999), image caption-ing (Vinyals et al., 2015) and table-to-text gener-ation (Lebret et al., 2016), traditional QG mainlyfocused on textual inputs, especially declarativesentences, explained by the original applicationdomains of question answering and education,which also typically featured textual inputs.Recently, with the growth of various QAapplications such as Knowledge Base QuestionAnswering (KBQA) (Cui et al., 2017) and VisualQuestion Answering (VQA) (Antol et al., 2015),NQG research has also widened the spectrum ofsources to include knowledge bases (Khapra et al.,2017) and images (Mostafazadeh et al., 2016).This trend is also spurred by the remark-able success of neural models in featurerepresentation, especially on image fea-tures (Krizhevsky et al., 2012) and knowledgerepresentations (Bordes et al., 2013). We discussadapting NQG models to other input modalities inSection 6.2.
Finally, we consider the required cognitive pro-cess behind question asking, a distinguishing fac-tor for questions (Anderson et al., 2001). A typicalramework that attempts to categorize the cogni-tive levels involved in question asking comes fromBloom’s taxonomy (Bloom et al., 1984), whichhas undergone several revisions and currently hassix cognitive levels:
Remembering , Understand-ing , Applying , Analyzing , Evaluating and
Creat-ing (Anderson et al., 2001).Traditional QG focuses on shallow levels ofBloom’s taxonomy: typical QG research ison generating sentence-based factoid questions( e.g. , Who , What , Where questions), whose an-swers are simple constituents in the input sen-tence (Heilman and Smith, 2010; Heilman, 2011).However, a QG system achieving human cogni-tive level should be able to generate meaningfulquestions that cater to higher levels of Bloom’staxonomy (Desai et al., 2018), such as
Why , What-if , and
How questions. Traditionally, those “deep”questions are generated through shallow methodssuch as handcrafted templates (Liu et al., 2012;Rokhlenko and Szpektor, 2013); however, thesemethods lack a real understanding and reasoningover the input.Although asking deep questions is com-plex, NQG’s ability to generalize over volumi-nous data has enabled recent research to ex-plore the comprehension and reasoning aspectsof QG (Labutov et al., 2015; Rothe et al., 2017;Chen et al., 2018; Desai et al., 2018). We inves-tigate this trend in Section 6.3, examining the lim-itations of current Seq2Seq model in generatingdeep questions, and the efforts made by existingworks, indicating further directions ahead.The rest of this paper provides a systematic sur-vey of NQG, covering corpus and evaluation met-rics before examining specific neural models.
As QG can be regarded as a dual task of QA,in principle any QA dataset can be used forQG as well. However, there are at least twocorpus-related factors that affect the difficultyof question generation. The first is the re-quired cognitive level to answer the question, aswe discussed in the previous section. CurrentNQG has achieved promising results on datasetsconsisting mainly of shallow factoid questions,such as SQuAD (Rajpurkar et al., 2016) and MSMARCO (Nguyen et al., 2016). However, theperformance drops significantly on deep questiondatasets, such as LearningQ (Chen et al., 2018),shown in Section 6.3. The second factor is the an- swer type , i.e., the expected form of the answer,typically having four settings: (1) the answer isa text span in the passage, which is usually thecase for factoid questions, (2) human-generated,abstractive answer that may not appear in the pas-sage, usually the case for deep questions, (3) mul-tiple choice question where question and its dis-tractors should be jointly generated, and (4) nogiven answer, which requires the model to auto-matically learn what is worthy to ask. The designof NQG system differs accordingly.Table 1 presents a listing of the NQG corporagrouped by their cognitive level and answer type,along with their statistics. Among them, SQuADwas used by most groups as the benchmark to eval-uate their NQG models. This provides a fair com-parison between different techniques. However, itraises the issue that most NQG models work onfactoid questions with answer as text span, leavingother types of QG problems less investigated, suchas generating deep multi-choice questions. Toovercome this, a wider variety of corpora shouldbe benchmarked against in future NQG research.
Although the datasets are commonly shared be-tween QG and QA, it is not the case for evalua-tion: it is challenging to define a gold standard ofproper questions to ask. Meaningful, syntacticallycorrect, semantically sound and natural are all use-ful criteria, yet they are hard to quantify. Most QGsystems involve human evaluation , commonly byrandomly sampling a few hundred generated ques-tions, and asking human annotators to rate them ona -point Likert scale. The average rank or the per-centage of best-ranked questions are reported andused for quality marks.As human evaluation is time-consuming,common automatic evaluation metrics forNLG, such as BLEU (Papineni et al., 2002),METEOR (Lavie and Denkowski, 2009), andROUGE (Lin, 2004), are also widely used.However, some studies (Callison-Burch et al.,2006; Liu et al., 2016) have shown that thesemetrics do not correlate well with fluency, ade-quacy, coherence, as they essentially compute the n -gram similarity between the source sentenceand the generated question. To overcome this,Nema and Khapra (2018) proposed a new metricto evaluate the “answerability” of a question bycalculating the scores for several question-specific ognitive Dataset / Contributor Answer Domain StatisticsLevel Type Documents Questions Q./DocShallow SQuAD (Rajpurkar et al., 2016) text span Wikipedia 20,958 97,888 4.67NewsQA (Trischler et al., 2017) text span News 12,744 119,633 9.39Medium MS MARCO (Nguyen et al., 2016) human generated Web article 1,010,916 3,563,535 3.53RACE (Lai et al., 2017) multiple choice Education 27,933 72,547 2.60Deep LearningQ (Chen et al., 2018) no answer Education 10,841 231,470 21.35NarrativeQA (Kocisk´y et al., 2018) human generated Story 1,572 46,765 29.75 Table 1: NQG datasets grouped by their cognitive level and answer type, where the number of documents, thenumber of questions, and the average number of questions per document (Q./Doc) for each corpus are listed. factors, including question type, content words,function words, and named entities. However, asit is newly proposed, it has not been applied toevaluate any NQG system yet.To accurately measure what makes a good ques-tion, especially deep questions, improved evalua-tion schemes are required to specifically investi-gate the mechanism of question asking.
Many current NQG models follow the Seq2Seq ar-chitecture. Under this framework, given a passage(usually a sentence) X = ( x , · · · , x n ) and (pos-sibly) a target answer A (a text span in the pas-sage) as input, an NQG model aims to generate aquestion Y = ( y , · · · , y m ) asking about the tar-get answer A in the passage X , which is definedas finding the best question ¯ Y that maximizes theconditional likelihood given the passage X and theanswer A : ¯ Y = arg max Y P ( Y | X, A ) (1) = arg max Y m X t =1 P ( y t | X, A, y Leveraging rich paragraph-level contexts aroundthe input text is another natural consideration toproduce better questions. According to (Du et al.,2017), around 20% of questions in SQuAD requireparagraph-level information to be answered. How-ever, as input texts get longer, Seq2Seq modelshave a tougher time effectively utilizing relevantcontexts, while avoiding irrelevant information.To address this challenge, Zhao et al. (2018)proposed a gated self-attention encoder to refinethe encoded context by fusing important informa-tion with the context’s self-representation prop-erly, which has achieved state-of-the-art resultson SQuAD. The long passage consisting of in-put texts and its context is first embedded viaLSTM with answer position as an extra feature.The encoded representation is then fed through agated self-matching network (Wang et al., 2017b)to aggregate information from the entire passageand embed intra-passage dependencies. Finally,a feature fusion gate (Gong and Bowman, 2018)chooses relevant information between the originaland self-matching enhanced representations.Instead of leveraging the whole context,Du and Cardie (2018) performed a pre-filtering byrunning a coreference resolution system on thecontext passage to obtain coreference clusters forboth the input sentence and the answer. The co-referred sentences are then fed into a gating net-work, from which the outputs serve as extra fea-tures to be concatenated with the original inputvectors. The aforementioned models require the target an-swer as an input, in which the answer essentiallyserves as the focus of asking. However, in thecase that only the input passage is given, a QGsystem should automatically identify question-worthy parts within the passage. This task is syn-onymous with content selection in traditional QG.To date, only two works (Du and Cardie, 2017;Subramanian et al., 2018) have worked in this set-ting. They both follow the traditional decompo-sition of QG into content selection and questionconstruction but implement each task using neuraletworks. For content selection, Du and Cardie(2017) learn a sentence selection task to identifyquestion-worthy sentences from the input para-graph using a neural sequence tagging model.Subramanian et al. (2018) train a neural keyphraseextractor to predict keyphrases of the passage.For question construction, they both employedthe Seq2Seq model, for which the input is eitherthe selected sentence or the input passage withkeyphrases as target answer.However, learning what aspect to ask about isquite challenging when the question requires rea-soning over multiple pieces of information withinthe passage; cf the Gollum question from the intro-duction. Beyond retrieving question-worthy infor-mation, we believe that studying how different rea-soning patterns (e.g., inductive, deductive, causaland analogical) affects the generation process willbe an aspect for future study. Common techniques of NLG have also been con-sidered in NQG model, summarized as tactics: 1. Copying Mechanism. Most NQGmodels (Zhou et al., 2017; Yuan et al., 2017;Wang et al., 2018; Harrison and Walker, 2018;Kumar et al., 2018a) employ the copying mech-anism of G ¨ulc¸ehre et al. (2016), which directlycopies relevant words from the source sentence tothe question during decoding. This idea is widelyaccepted as it is common to refer back to phrasesand entities appearing in the text when formulatingfactoid questions, and difficult for a RNN decoderto generate such rare words on its own. 2. Linguistic Features. Approaches alsoseek to leverage additional linguistic featuresthat complements word embeddings, includingword case, POS and NER tags (Zhou et al.,2017; Wang et al., 2018) as well as corefer-ence (Harrison and Walker, 2018) and depen-dency information (Kumar et al., 2018a). Thesecategorical features are vectorized and concate-nated with word embeddings. The feature vectorscan be either one-hot or trainable and serve as in-put to the encoder. 3. Policy Gradient. Optimizing for just ground-truth log likelihood ignores the many equiva-lent ways of asking a question. Relevant QGwork (Yuan et al., 2017; Kumar et al., 2018b) haveadopted policy gradient methods to add task-specific rewards (such as BLEU or ROUGE) to the original objective. This helps to diversify thequestions generated, as the model learns to dis-tribute probability mass among equivalent expres-sions rather than the single ground truth question. In Table 2, we summarize existing NQG mod-els with their employed techniques and their best-reported performance on SQuAD. These meth-ods achieve comparable results; as of this writing,Zhao et al. (2018) is the state-of-the-art.Two points deserve mention. First, while thecopying mechanism has shown marked improve-ments, there exist shortcomings. Kim et al. (2019)observed many invalid answer-revealing questionsattributed to the use of the copying mechanism; cf the John Francis example in Section 5.1. Theyabandoned copying but still achieved a perfor-mance rivaling other systems. In parallel ap-plication areas such as machine translation, thecopy mechanism has been to a large extent re-placed with self-attention (Lin et al., 2017) ortransformer (Vaswani et al., 2017). The futureprospect of the copying mechanism requires fur-ther investigation. Second, recent approachesthat employ paragraph-level contexts have shownpromising results: not only boosting performance,but also constituting a step towards deep questiongeneration, which requires reasoning over richcontexts. We discuss three trends that we wish to call prac-titioners’ attention to as NQG evolves to take thecenter stage in QG: Multi-task Learning, Wider In-put Modalities and Deep Question Generation. As QG has become more mature, work has startedto investigate how QG can assist in other NLPtasks, and vice versa. Some NLP tasks benefitfrom enriching training samples by QG to alleviatethe data shortage problem. This idea has been suc-cessfully applied to semantic parsing (Guo et al.,2018a) and QA (Sachan and Xing, 2018). In thesemantic parsing task that maps a natural lan-guage question to a SQL query, Guo et al. (2018a)achieved a 3 % performance gain with an en-larged training set that contains pseudo-labeled ( SQL, question ) pairs generated by a Seq2SeqQG model. In QA, Sachan and Xing (2018) em-ployed the idea of self-training (Nigam and Ghani, odels Answer Encoding Features PerformanceQW PC CP LF PG BLEU-4 METEOR ROUGE L Du et al. (2017) not used 12.28 16.62 39.75Duan et al. (2017) not used • − − Zhou et al. (2017) answer position • • − − Yuan et al. (2017) answer position • • − − Wang et al. (2018) answer position • • • • • • • • • − − Zhao et al. (2018) answer position • • Du and Cardie (2018) answer position • • − Song et al. (2018) separate encoder • Table 2: Existing NQG models with their best-reported performance on SQuAD. Legend: QW : question wordgeneration, PC : paragraph-level context, CP : copying mechanism, LF : linguistic features, PG : policy gradient. q and a candidate answer ˆ a , theygenerate a question ˆ q for ˆ a by way of QG system.Since the generated question ˆ q is closely related to ˆ a , the similarity between q and ˆ q helps to evaluatewhether ˆ a is the correct answer.Other works focus on jointly training tocombine QG and QA. Wang et al. (2017a)simultaneously train the QG and QA mod-els in the same Seq2Seq model by alternat-ing input data between QA and QG examples.Tang et al. (2018) proposed a training algorithmthat generalizes Generative Adversarial Network(GANs) (Goodfellow et al., 2014) under the ques-tion answering scenario. The model improves QGby incorporating an additional QA-specific loss,and improving QA performance by adding artifi- cially generated training instances from QG. How-ever, while joint training has shown some effec-tiveness, due to the mixed objectives, its perfor-mance on QG are lower than the state-of-the-artresults, which leaves room for future exploration. QG work now has incorporated input from knowl-edge bases (KBQG) and images (VQG).Inspired by the use of SQuAD as a questionbenchmark, Serban et al. (2016) created a 30Mlarge-scale dataset of (KB triple, question) pairsto spur KBQG work. They baselined an attentionseq2seq model to generate the target factoid ques-tion. Due to KB sparsity, many entities and pred-icates are unseen or rarely seen at training time.ElSahar et al. (2018) address these few-/zero-shot issues by applying the copying mechanism and in-corporating textual contexts to enrich the informa-tion for rare entities and relations. Since a sin-gle KB triple provides only limited information,KB-generated questions also overgeneralize — amodel asks “Who was born in New York?” whengiven the triple (Donald Trump, Place of birth,New York) . To solve this, Khapra et al. (2017)enrich the input with a sequence of keywords col-lected from its related triples.Visual Question Generation (VQG) is anotheremerging topic which aims to ask questions givenan image. We categorize VQG into grounded- and open-ended VQG by the level of cognition.Grounded VQG generates visually grounded ques-tions, i.e. , all relevant information for the answercan be found in the input image (Zhang et al.,2017). A key purpose of grounded VQG is to sup-port the dataset construction for VQA. To ensurethe questions are grounded, existing systems relyn image captions to varying degrees. Ren et al.(2015) and Zhu et al. (2016) simply convert im-age captions into questions using rule-based meth-ods with textual patterns. Zhang et al. (2017) pro-posed a neural model that can generate questionswith diverse types for a single image, using sep-arate networks to construct dense image captionsand to select question types.In contrast to grounded QG, humans ask highercognitive level questions about what can be in-ferred rather than what can be seen from an image.Motivated by this, Mostafazadeh et al. (2016)proposed open-ended VQG that aims to gener-ate natural and engaging questions about an im-age. These are deep questions that require highcognition such as analyzing and creation. Withsignificant progress in deep generative models,marked by variational auto-encoders (VAEs) andGANs, such models are also used in open-endedVQG to bring “creativity” into generated ques-tions (Jain et al., 2017; Fan et al., 2018), showingpromising results. This also brings hope to addressdeep QG from text, as applied in NLG: e.g. , Seq-GAN (Yu et al., 2017) and LeakGAN (Guo et al.,2018c). Endowing a QG system with the ability to askdeep questions will help us build curious ma-chines that can interact with humans in a bet-ter manner. However, Rus et al. (2007) pointedout that asking high-quality deep questions isdifficult, even for humans. Citing the studyfrom Graesser and Person (1994) to show that stu-dents in college asked only about deep-reasoningquestions per hour in a question–encouraging tu-toring session. These deep questions are oftenabout events, evaluation, opinions, syntheses orreasons, corresponding to higher-order cognitivelevels.To verify the effectiveness of existing NQGmodels in generating deep questions, Chen et al.(2018) conducted an empirical study that appliesthe attention Seq2Seq model on LearningQ, adeep-question centric dataset containing over 60 % questions that require reasoning over multiple sen-tences or external knowledge to answer. However,the results were poor; the model achieved minis-cule BLEU-4 scores of < and METEOR scoresof < , compared with > (BLEU-4) and > (METEOR) on SQuAD. Despite further in-depthanalysis are needed to explore the reasons behind, we believe there are two plausible explanations:(1) Seq2Seq models handle long inputs ineffec-tively, and (2) Seq2Seq models lack the ability toreason over multiple pieces of information.Despite still having a long way to go, someworks have set out a path forward. Afew early QG works attempted to solve thisthrough building deep semantic representationsof the entire text, using concept maps over key-words (Olney et al., 2012) or minimal recursionsemantics (Yao and Zhang, 2010) to reason overconcepts in the text. Labutov et al. (2015) pro-posed a crowdsourcing-based workflow that in-volves building an intermediate ontology for theinput text, soliciting question templates throughcrowdsourcing, and generating deep questionsbased on template retrieval and ranking. Althoughthis process is semi-automatic, it provides a prac-tical and efficient way towards deep QG. In aseparate line of work, Rothe et al. (2017) pro-posed a framework that simulates how people askdeep questions by treating questions as formal pro-grams that execute on the state of the world, out-putting an answer.Based on our survey, we believe the roadmaptowards deep NGQ points towards research thatwill (1) enhance the NGQ model with the ability toconsider relationships among multiple source sen-tences, (2) explicitly model typical reasoning pat-terns, and (3) understand and simulate the mecha-nism behind human question asking. We have presented a comprehensive survey ofNQG, categorizing current NQG models based ondifferent QG-specific and common technical vari-ations, and summarizing three emerging trends inNQG: multi-task learning, wider input modalities,and deep question generation.What’s next for NGQ? We end with futurepotential directions by applying past insights tocurrent NQG models; the “unknown unknown”,promising directions yet explored. When to Ask : Besides learning what and howto ask, in many real-world applications that ques-tion plays an important role, such as automated tu-toring and conversational systems, learning whento ask become an important issue. In contrast togeneral dialog management (Lee et al., 2010), noresearch has explored when machine should askan engaging question in dialog. Modeling ques-tion asking as an interactive and dynamic processay become an interesting topic ahead. Personalized QG : Question asking is quite per-sonalized: people with different characters andknowledge background ask different questions.However, integrating QG with user modeling indialog management or recommendation systemhas not yet been explored. Explicitly modelinguser state and awareness leads us towards person-alized QG, which dovetails deep, end-to-end QGwith deep user modeling and pairs the dual ofgeneration–comprehension much in the same veinas in the vision–image generation area. References Husam Ali, Yllias Chali, and Sadid A Hasan. 2010.Automation of question generation from sentences.In Proceedings of QG2010: The Third Workshop onQuestion Generation , pages 58–67.Lorin W Anderson, David R Krathwohl, Peter WAirasian, Kathleen A Cruikshank, Richard E Mayer,Paul R Pintrich, James Raths, and Merlin C Wit-trock. 2001. A taxonomy for learning, teaching, andassessing: A revision of blooms taxonomy of edu-cational objectives, abridged edition. White Plains,NY: Longman .Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: visual question an-swering. In IEEE International Conference on Com-puter Vision (ICCV) , pages 2425–2433.Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR ,abs/1409.0473.Benjamin Samuel Bloom, Max D Engelhart, Edward JFurst, Walker H Hill, and David R Krathwohl. 1984. Taxonomy of educational objectives: Handbook 1:Cognitive domain .Antoine Bordes, Nicolas Usunier, Alberto Garc´ıa-Dur´an, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Annual Conference on Neural In-formation Processing Systems (NIPS) , pages 2787–2795.Chris Callison-Burch, Miles Osborne, and PhilippKoehn. 2006. Re-evaluation the role of bleu in ma-chine translation research. In Conference of theEuropean Chapter of the Association for Computa-tional Linguistics (EACL) .Yllias Chali and Sadid A. Hasan. 2015. Towards topic-to-question generation. Computational Linguistics(CL) , 41(1):1–20. Guanliang Chen, Jie Yang, Claudia Hauff, and Geert-Jan Houben. 2018. Learningq: A large-scaledataset for educational question generation. In In-ternational Conference on Web and Social Media(ICWSM) , pages 481–490.Wei Chen and Jack Mostow. 2009. Generating ques-tions automatically from informational text. In In-ternational Conference on Artificial Intelligence inEducation (AIED) , pages 17–24.Wanyun Cui, Yanghua Xiao, Haixun Wang, YangqiuSong, Seung-won Hwang, and Wei Wang. 2017.KBQA: learning question answering over QA cor-pora and knowledge bases. The Proceedings of theVLDB Endowment (PVLDB) , 10(5):565–576.Takshak Desai, Parag Dakle, and Dan Moldovan. 2018.Generating questions for reading comprehension us-ing coherence relations. In The 5th Workshop onNatural Language Processing Techniques for Edu-cational Applications (NLP-TEA@ACL) , pages 1–10.Xinya Du and Claire Cardie. 2017. Identifying whereto focus in reading comprehension for neural ques-tion generation. In Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) ,pages 2067–2073.Xinya Du and Claire Cardie. 2018. Harvest-ing paragraph-level question-answer pairs fromwikipedia. In Annual Meeting of the Associationfor Computational Linguistics (ACL) , pages 1907–1917.Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-ing to ask: Neural question generation for readingcomprehension. In Annual Meeting of the Associ-ation for Computational Linguistics (ACL) , pages1342–1352.Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou.2017. Question generation for question answer-ing. In Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 866–874.Hady ElSahar, Christophe Gravier, and Fr´ed´eriqueLaforest. 2018. Zero-shot question generation fromknowledge graphs for unseen predicates and entitytypes. In Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics (NAACL-HLT) , pages 218–228.Zhihao Fan, Zhongyu Wei, Siyuan Wang, Yang Liu,and Xuanjing Huang. 2018. A reinforcement learn-ing framework for natural question generation usingbi-discriminators. In International Conference onComputational Linguistics (COLING) , pages 1763–1774.D Gates. 2008. Generating look-back strategy ques-tions from expository texts. In The Workshop onthe Question Generation Shared Task and Evalua-tion Challenge .ichen Gong and Samuel R. Bowman. 2018. Ruminat-ing reader: Reasoning with gated multi-hop atten-tion. In Workshop on Machine Reading for QuestionAnswering@ACL , pages 1–11.Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron C. Courville, and Yoshua Bengio. 2014. Gen-erative adversarial nets. In Annual Conferenceon Neural Information Processing Systems (NIPS) ,pages 2672–2680.Arthur C Graesser and Natalie K Person. 1994. Ques-tion asking during tutoring. American EducationalResearch Journal , 31(1):104–137.C¸ aglar G¨ulc¸ehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Annual Meeting of the Asso-ciation for Computational Linguistics (ACL) .Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin,Hong Chi, James Cao, Peng Chen, and Ming Zhou.2018a. Question generation from SQL queries im-proves neural semantic parsing. In Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 1597–1607.Han Guo, Ramakanth Pasunuru, and Mohit Bansal.2018b. Soft layer-specific multi-task summarizationwith entailment and question generation. In AnnualMeeting of the Association for Computational Lin-guistics (ACL) , pages 687–697.Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, YongYu, and Jun Wang. 2018c. Long text generationvia adversarial training with leaked information. In AAAI Conference on Artificial Intelligence (AAAI) ,pages 5141–5148.Vrindavan Harrison and Marilyn A. Walker. 2018.Neural generation of diverse questions using answerfocus, contextual and linguistic features. In Interna-tional Conference on Natural Language Generation(INLG) , pages 296–306.Michael Heilman. 2011. Automatic factual questiongeneration from text. Language Technologies Insti-tute School of Computer Science Carnegie MellonUniversity , 195.Michael Heilman and Noah A. Smith. 2010. Goodquestion! statistical ranking for question generation.In Annual Conference of the North American Chap-ter of the Association for Computational Linguistics(NAACL-HLT) , pages 609–617.Unnat Jain, Ziyu Zhang, and Alexander G. Schwing.2017. Creativity: Generating diverse questionsusing variational autoencoders. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) , pages 5415–5424.Mitesh M. Khapra, Dinesh Raghu, Sachindra Joshi,and Sathish Reddy. 2017. Generating natural lan-guage question-answer pairs from a knowledge graph using a RNN based question generationmodel. In Conference of the European Chapterof the Association for Computational Linguistics(EACL) , pages 376–385.Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Ky-omin Jung. 2019. Improving neural question gener-ation using answer separation. In AAAI Conferenceon Artificial Intelligence (AAAI) .Tom´as Kocisk´y, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, G´abor Melis, andEdward Grefenstette. 2018. The narrativeqa read-ing comprehension challenge. Transactions of theAssociation for Computational Linguistics (TACL) ,6:317–328.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Annual Conferenceon Neural Information Processing Systems (NIPS) ,pages 1106–1114.Vishwajeet Kumar, Kireeti Boorla, Yogesh Meena,Ganesh Ramakrishnan, and Yuan-Fang Li. 2018a.Automating reading comprehension by generatingquestion and answer pairs. In The Pacific-Asia Con-ference on Knowledge Discovery and Data Mining(PAKDD) , pages 335–348.Vishwajeet Kumar, Ganesh Ramakrishnan, and Yuan-Fang Li. 2018b. A framework for automatic ques-tion generation from text using deep reinforcementlearning. CoRR , abs/1808.04961.Igor Labutov, Sumit Basu, and Lucy Vanderwende.2015. Deep questions without deep understanding.In Annual Meeting of the Association for Computa-tional Linguistics (ACL) , pages 889–898.Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard H. Hovy. 2017. RACE: large-scale read-ing comprehension dataset from examinations. In Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 785–794.Alon Lavie and Michael J. Denkowski. 2009. Themeteor metric for automatic evaluation of machinetranslation. Machine Translation , 23(2-3):105–115.R´emi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain. In Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 1203–1213.Cheongjae Lee, Sangkeun Jung, Kyungduk Kim,Donghyeon Lee, and Gary Geunbae Lee. 2010. Re-cent approaches to dialog management for spokendialog systems. Journal of Computing Science andEngineering (JCSE) , 4(1):1–22.Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out .houhan Lin, Minwei Feng, C´ıcero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. CoRR , abs/1703.03130.David Lindberg, Fred Popowich, John C. Nesbit, andPhilip H. Winne. 2013. Generating natural lan-guage questions to support learning on-line. In Eu-ropean Workshop on Natural Language Generation(ENLG) , pages 105–114.Chia-Wei Liu, Ryan Lowe, Iulian Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How NOT to evaluate your dialogue sys-tem: An empirical study of unsupervised evaluationmetrics for dialogue response generation. In Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 2122–2132.Ming Liu, Rafael A. Calvo, and Vasile Rus. 2010.Automatic question generation for literature reviewwriting support. In International Conference on In-telligent Tutoring Systems (ITS) , pages 45–54.Ming Liu, Rafael A. Calvo, and Vasile Rus. 2012. G-asks: An intelligent automatic question generationsystem for academic writing support. Dialogue andDiscourse (D&D) , 3(2):101–124.Inderjeet Mani. 1999. Advances in automatic text sum-marization . MIT press.Karen Mazidi and Rodney D. Nielsen. 2014. Linguis-tic considerations in automatic question generation.In Annual Meeting of the Association for Computa-tional Linguistics (ACL) , pages 321–326.Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-garet Mitchell, Xiaodong He, and Lucy Vander-wende. 2016. Generating natural questions aboutan image. In Annual Meeting of the Association forComputational Linguistics (ACL) .Preksha Nema and Mitesh M. Khapra. 2018. Towards abetter metric for evaluating question generation sys-tems. In Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 3950–3959.Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. MS MARCO: A human generated machinereading comprehension dataset. In Proceedings ofthe NIPS Workshop on Cognitive Computation: In-tegrating neural and symbolic approaches .Kamal Nigam and Rayid Ghani. 2000. Analyzing theeffectiveness and applicability of co-training. In In-ternational Conference on Information and Knowl-edge Management (CIKM) , pages 86–93.Andrew McGregor Olney, Arthur C. Graesser, and Na-talie K. Person. 2012. Question generation fromconcept maps. Dialogue and Discourse (D&D) ,3(2):75–99. Santanu Pal, Tapabrata Mondal, Partha Pakray, Di-pankar Das, and Sivaji Bandyopadhyay. 2010. Qg-stec system description–juqgg: A rule based ap-proach. Proceedings of QG2010: The Third Work-shop on Question Generation , pages 76–79.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automaticevaluation of machine translation. In Annual Meet-ing of the Association for Computational Linguistics(ACL) , pages 311–318.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 2383–2392.Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015.Exploring models and data for image question an-swering. In Annual Conference on Neural Informa-tion Processing Systems (NIPS) , pages 2953–2961.Oleg Rokhlenko and Idan Szpektor. 2013. Generat-ing synthetic comparable questions for news articles.In Annual Meeting of the Association for Computa-tional Linguistics (ACL) , pages 742–751.Anselm Rothe, Brenden M. Lake, and Todd M.Gureckis. 2017. Question asking as program gener-ation. In Annual Conference on Neural InformationProcessing Systems (NIPS) , pages 1046–1055.Vasile Rus, Zhiqiang Cai, and Art Graesser. 2008.Question generation: Example of a multi-year evalu-ation campaign. Online Proceedings of 1st QuestionGeneration Workshop, NSF, Arlington, VA. Vasile Rus, Zhiqiang Cai, and Arthur C. Graesser.2007. Experiments on generating questions aboutfacts. In Computational Linguistics and IntelligentText Processing (CICLing) , pages 444–455.Vasile Rus and James C. Lester. 2009. The 2ndworkshop on question generation. In InternationalConference on Artificial Intelligence in Education(AIED) , page 808.Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean,Svetlana Stoyanchev, and Cristian Moldovan. 2010.Overview of the first question generation shared taskevaluation challenge. In Proceedings of QG2010:The Third Workshop on Question Generation , pages45–57.Vasile Rus, Brendan Wyse, Paul Piwek, Mihai C. Lin-tean, Svetlana Stoyanchev, and Cristian Moldovan.2011. Question generation shared task and evalu-ation challenge - status report. In European Work-shop on Natural Language Generation (ENLG) ,pages 318–320.Vasile Rus, Brendan Wyse, Paul Piwek, Mihai C. Lin-tean, Svetlana Stoyanchev, and Cristian Moldovan.2012. A detailed account of the first question gen-eration shared task evaluation challenge. Dialogueand Discourse (D&D) , 3(2):177–204.rinmaya Sachan and Eric P. Xing. 2018. Self-training for jointly learning to ask and answer ques-tions. In Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics (NAACL-HLT) , pages 629–640.Iulian Vlad Serban, Alberto Garc´ıa-Dur´an, C¸ aglarG¨ulc¸ehre, Sungjin Ahn, Sarath Chandar, Aaron C.Courville, and Yoshua Bengio. 2016. Generatingfactoid questions with recurrent neural networks:The 30m factoid question-answer corpus. In AnnualMeeting of the Association for Computational Lin-guistics (ACL) .Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang,and Daniel Gildea. 2018. Leveraging context infor-mation for natural question generation. In AnnualConference of the North American Chapter of theAssociation for Computational Linguistics (NAACL-HLT) , pages 569–574.Sandeep Subramanian, Tong Wang, Xingdi Yuan,Saizheng Zhang, Adam Trischler, and Yoshua Ben-gio. 2018. Neural models for key phrase extractionand question generation. In Workshop on MachineReading for Question Answering@ACL , pages 78–88.Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yan-jun Ma, and Shi Wang. 2018. Answer-focused andposition-aware neural question generation. In Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 3930–3939.Duyu Tang, Nan Duan, Zhao Yan, Zhirui Zhang, YiboSun, Shujie Liu, Yuanhua Lv, and Ming Zhou. 2018.Learning to collaborate for question answering andasking. In Annual Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL-HLT) , pages 1564–1574.Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. Newsqa: A machine com-prehension dataset. In Workshop on RepresentationLearning for NLP (Rep4NLP@ACL) , pages 191–200.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Annual Conference on Neural Informa-tion Processing Systems (NIPS) , pages 6000–6010.Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , pages3156–3164.Tong Wang, Xingdi Yuan, and Adam Trischler. 2017a.A joint model for question answering and questiongeneration. CoRR , abs/1706.01450. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017b. Gated self-matching net-works for reading comprehension and question an-swering. In Annual Meeting of the Association forComputational Linguistics (ACL) , pages 189–198.Zichao Wang, Andrew S. Lan, Weili Nie, Andrew E.Waters, Phillip J. Grimaldi, and Richard G. Bara-niuk. 2018. Qg-net: a data-driven question genera-tion model for educational content. In Annual ACMConference on Learning at Scale (L@S) , pages 7:1–7:10.Xuchen Yao, Gosse Bouma, and Yi Zhang. 2012.Semantics-based question generation and imple-mentation. Dialogue and Discourse (D&D) ,3(2):11–42.Xuchen Yao and Yi Zhang. 2010. Question generationwith minimal recursion semantics. In Proceedingsof QG2010: The Third Workshop on Question Gen-eration , pages 68–75.Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In AAAI Conference on Artifi-cial Intelligence (AAAI) , pages 2852–2858.Xingdi Yuan, Tong Wang, C¸ aglar G¨ulc¸ehre, Alessan-dro Sordoni, Philip Bachman, Saizheng Zhang,Sandeep Subramanian, and Adam Trischler. 2017.Machine comprehension by text-to-text neural ques-tion generation. In The 2nd Workshop on Represen-tation Learning for NLP (Rep4NLP@ACL) , pages15–25.Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang,and Jiawan Zhang. 2017. Automatic generation ofgrounded visual questions. In International JointConference on Artificial Intelligence (IJCAI) , pages4235–4243.Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke.2018. Paragraph-level neural question generationwith maxout pointer and gated self-attention net-works. In Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 3901–3910.Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2017. Neu-ral question generation from text: A preliminarystudy. In CCF International Conference of Natu-ral Language Processing and Chinese Computing(NLPCC) , pages 662–671.Yuke Zhu, Oliver Groth, Michael S. Bernstein, andLi Fei-Fei. 2016. Visual7w: Grounded question an-swering in images. In