Asking Complex Questions with Multi-hop Answer-focused Reasoning
11 EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Asking Complex Questions with Multi-hop Answer-focused Reasoning
Xiyao Ma , Qile Zhu , Yanlin Zhou , Xiaolin Li , Dapeng Wu NSF Center for Big Learning, University of Florida Cognization Lab { maxiy,valder, zhou.y } @ufl.edu, [email protected], dpwu@ufl.edu Abstract
Asking questions from natural language texthas attracted increasing attention recently, andseveral schemes have been proposed withpromising results by asking the right questionwords and copy relevant words from the in-put to the question. However, most state-of-the-art methods focus on asking simple ques-tions involving single-hop relations. In thispaper, we propose a new task called multi-hop question generation that asks complexand semantically relevant questions by addi-tionally discovering and modeling the multi-ple entities and their semantic relations givena collection of documents and the correspond-ing answer . To solve the problem, we pro-pose multi-hop answer-focused reasoning onthe grounded answer-centric entity graph toinclude different granularity levels of seman-tic information including the word-level anddocument-level semantics of the entities andtheir semantic relations. Through extensiveexperiments on the HOTPOTQA dataset, wedemonstrate the superiority and effectivenessof our proposed model that serves as a base-line to motivate future work. Our work is similar to Pan et al. (2020), and we proposethe similar noval challenging task on HOTPOTQA datasetindependently. The major differences are: 1. We built thegraph for reasoning following different heuristics. Specifi-cally, Pan et al. (2020) maily adopted SRL and dependenceparse tree, while we utilized NER, coreference resolution, andsurface matching. 2. Pan et al. (2020) included all the dataexamples in HOTPOTQA dataset for training and validation.However, after deep diving into the dataset, we argued that thequestions with type of ”comparison” are not suitable for theproposed task as they do not require QG models to discoverand gather mutli-hop semantic relations among the entities.3. Different from (Pan et al., 2020) where no testing datasetis available to evaluate models, we proposed to combine thetraining and dev dataset together and split them into train-inng, dev, and testing dataset. Please find the detail in theSection 3.1. The dataset and code are available at https://github.com/Shawn617/Multi-hop-NQG
Given a background context and the correspondinganswer, the question generation (QG) task aimsto ask a semantically relevant question. QG hasconsiderable benefits in education scenario, dia-logue systems, and question answering (Du et al.,2017). Recently, many approaches have been pro-posed to solve the problem (Zhou et al., 2017; Sunet al., 2018; Ma et al., 2019), mostly realized byvariants of the seq-to-seq model (Sutskever et al.,2014) with attention and copy mechanism (Choet al., 2014; Bahdanau et al., 2014).However, existing works mainly focus on askinga simple question Y = { y t } Nt =1 by only capturingone direct relation among the entities from the con-text input X = { x t } Mt =1 . Taking one example fromSQuAD dataset (Rajpurkar et al., 2016) as shown inthe upper part of the Table 1, the model only needsto capture the single-hop relation between entity“Donald Davies” and the answer entity “MessageRouting Methodology” and ask the question “Whatdid Donald Davies develop?”In this paper, we propose a new task called multi-hop neural question generation. Given a collectionof documents D = { d i } Ii =1 = { X itext , X ititle } Ii =1 each containing a context X itext and a title X ititle ,assuming that the answer A exists in at least onedocument, the model aims to generate a complexand semantically relevant question Y = { y t } Nt =1 with multiple entities and their semantic relations.One example is shown in the lower part of Table1. The model need to discover and capture theentities (e.g., “Peggy Seege”, “Ewan MacColl”,and “James Henry Miller”) and their relations (e.g.,“Peggy Seege” marrited to “James Henry Miller”,and “James Henry Miller” is the stage name is“Ewan MacColl”), then ask the question “What na-tionality was James HenryMillers wife?” accordingto the answer “American”. a r X i v : . [ c s . C L ] S e p EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Table 1: Comparison of the single-hop question genera-tion task on the SQuAD dataset (Rajpurkar et al., 2016)and the proposed multi-hop question generation task onthe HOTPOTQA dataset (Yang et al., 2018) . Single-hop Question GenerationDocument : starting in 1965, Donald Davies at theNational Physical Laboratory, UK, independentlydeveloped the same Message Routing Methodologyas developed by baran.
Question : What did Donald Davies develop?
Multi-hop Question GenerationDocument 1 : [Peggy Seeger] Margaret “Peggy”Seeger (born June 17, 1935) is an American folksinger.She is also well known in Britain, where she has livedfor more than 30 years, and was married to the singerand songwriter Ewan MacColl until his death in 1989.
Document 2 : [Ewan MacColl] James Henry Miller(25 January 1915 22 October 1989), better known byhis stage name Ewan MacColl, was an English folksinger, songwriter, communist, labour activist, actor,poet, playwright and record producer.
Question : What nationality was James HenryMiller’s wife?
In addition to the common challenges in thesingle-hop question generation task that the modelneeds to understand, paraphrase, and re-organizesemantic information from the answer and the back-ground context, another key challenge is in dis-covering and modeling the entities and the multi-hop semantic relations across documents to under-stand the semantic relation between the answer andbackground context. Merely applying a seq-to-seqmodel on the document text does not deliver com-parable results in that the model performs poorlyon capturing the structured relations among theentities through multi-hop reasoning.In this paper, We propose the multi-hop answer-focused reasoning model to tackle the problem.Specifically, instead of utilizing the unstructuredtext as the only input, we build an answer-centricentity graph with the extracted different typesof semantic relations among the entities acrossthe documents to enable the multi-hop reasoning.Inspired by the success of graph convolutionalnetwork (GCN) models, we further leverage therelational graph convolutional network (RGCN)(Schlichtkrull et al., 2018) to perform the answer-aware multi-hop reasoning by aggregating the dif-ferent levels of answer-aware contextual entity rep-resentation and semantic relations among the enti-ties. The extensive experiments demonstrate thatour proposed model outperforms the baselines in terms of various metrics. Our contributions arethree-fold: • To the best of our knowledge, we are the firstto propose the multi-hop neural question gen-eration task, asking complex questions froma collection of documents through multi-hopreasoning. • We propose a multi-hop answer-focused rea-soning model to dynamically reason and ag-gregate different granularity levels of answer-aware contextual entity representation andsemantic relations among the entities in thegrounded answer-centric entity graph. • We conduct extensive experiments to demon-strate that our proposed model outperformsSOTA single-hop QG models and graph-basedmulti-hop QG model in terms of the main met-rics, downstream multi-hop reading compre-hension metrics, and human judgments. Ourwork offers a new baseline and motivates fu-ture researches on the task.
In this section, we present the architecture and eachmodule of the proposed multi-hop answer-focusedreasoning model. The overall architecture is shownin Figure 1. Our proposed method adopts a seq-to-seq backbone (Sutskever et al., 2014) incorporatedwith attention and copy mechanism (Bahdanauet al., 2014; Gulcehre et al., 2016). Our model canbe categorized into three parts: (i) answer-focuseddocument encoding, (ii) multi-hop answer-centricreasoning, and (iii) aggregation layer, finally pro-viding an answer-focused and enriched contextualrepresentation.
Given input documents tothe model, we represent them as a sequence ofwords X = { x i } Mi =1 by concatenating the textwords X itext and the title words X ititle of each doc-ument: X = (cid:8) X text , X title , ..., X Itext , X
Ititle (cid:9) , (1)Following (Zhou et al., 2017), for each word x , we obtain its embedding by concatenating itsword embedding, answer positional embedding,and feature-enriched embedding (e.g., POS, NER). EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 1: Architecture of Answer-focused Multi-hop Reasoning Model.
An one-layer bi-directional LSTM (Hochreiterand Schmidhuber, 1997) is utilized as the en-coder to obtain the document representation H =[ h , h , ..., h m ] ∈ R M ∗ D : h i = LSTM enc ( x i , h i − ) . (2) Gated Self-attention Layer
The above docu-ment representation has limited knowledge of thecontext (Wang et al., 2017). The gated self-attention layer is utilized to learn a contextual doc-ument representation ˆ h i with a Bi-GRU (Chunget al., 2014): ˆ h i = Bi-GRU (cid:16) ˆ h Di − , [ h i , o i ] (cid:17) , (3)where v i is the contextual vector obtained by at-tending to the context: d ij = W dT tanh( W (cid:48) v h j + W v h i ) , (4) a ik = exp( d ik ) / Σ nj =1 exp( d ij ) , (5) o i = Σ nk =1 a ik h k . (6)where W d , W v , and W (cid:48) v are the trainable weightsin the neural networks. Answer Gating Mechanism
We further proposethe answer gating mechanism to empower themodel to learn the answer-focused document rep-resentation H a = { ˆ h ai } Mi =1 . Utilizing a gate com-puted by a sigmoid function to control the infor-mation passing, only the answer-related semanticinformation of the documents is forwarded for thedownstream multi-hop reasoning: h ai = σ ( aW a ˆ h i ) ∗ ˆ h i . (7) Figure 2: Diagram of an answer-centric entity graphexample G = {V , E} built on the documents in Table1. The text in ovals and the solid lines in different col-ors indicate the different semantic types of the nodes V and the edges E , respectively. The answer node is con-nected with all other nodes in the graph, which are notdrawn for conciseness. where the answer vector a ∈ R D is the hiddenstate of the first answer word, and W a is a trainableparameters in the bilinear function. Toexplicitly discover and model the multiple enti-ties and their semantic relations across documents,we ground an answer-centric entity graph from theunstructured text.Let an answer-centric entity graph be denoted
EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. as G = {V , E} , where V denotes entity nodes atdifferent levels, and E denotes the edges betweennodes annotated with different semantic relations.To do so, we first exploit the Spacy toolkit (Hon-nibal and Montani, 2017) to extract the named en-tity and all coreference words. Then, we identifythe exactly matched non-stop words from the doc-uments. We treat these exactly matched non-stopwords, named entities, the answer and titles as thenodes in the answer-centric entity graph, whichrepresent different granularity levels of the contex-tual representation: (1) the exactly matched non-stop words and entity nodes encode the word-leveland local representation in the specific documentcontext; (2) the title nodes represent the document-level semantics; (3) the answer node offers theanswer-aware representation for the graph reason-ing, and models a global representation across doc-uments.We then define edges between two nodes byleveraging different types of semantics within thedocuments in the following heuristic:(1) We connect all exactly matched named enti-ties no matter they are in the same documents ordifferent documents (e.g., “Ewan MacColl”).(2) We connect all inter-document and intra-document exact matched non-stop words (e.g.“singer, songwriter”).(3) All coreference words are then linked to eachother.(4) We further connect the title node with allentity nodes within the same document.(5) We add dense connections between all titlenodes.(6) The answer node is connected to all othernodes in the graph, resulting in an answer-centricentity graph.An example graph built from the documents ofthe example in Table 1 is shown in Figure 2, repre-senting different granularity levels of the semanticinformation with various nodes and edges. Multi-hop Reasoning with RGCN
To make useof the grounded answer-centric entity graph, weleverage GNN-based model to conduct the multi-hop reasoning. In general, with different messagepassing strategies, graph neural network-basedmodels update the node representation based onits first-order neighbors.Specifically, we employ the RGCN for the multi- hop reasoning (Schlichtkrull et al., 2018). We firstinitialize the representation of node v ∈ V with theoutput from answer gating mechanism by v i = h aj or v i = average ( h aj , h aj + 1 , ..., h ak ) if the entitynode contains multiple words. Meanwhile, edgesare annotated by one-hot vectors indicating differ-ent semantic relations. In each layer ≤ (cid:96) ≤ L ,the representation of node i is updated by the sum-mation of the transformation of its node represen-tation and the transformation of its neighbors: a v ( l +1) i = σ (cid:88) r ∈R (cid:88) j ∈N ri c i,r W ( l ) r v ( l ) j + W ( l )0 v ( l ) i , (8)where W ( l ) r is relation-specific trainable weights.The number of parameters of the weights is fur-ther decreased by the linear combination of a basisweight W ( l ) b ∈ R d ( l +1) × d ( l ) and relation-specificcoefficients a ( l ) rb : W ( l ) r = B (cid:88) b =1 a ( l ) rb W ( l ) b . (9)After L layers of reasoning, at most L -hop relationscan be captured. Inspired by (Peters et al., 2018), the final answer-aware contextual representation is computed byselectively aggregating the output of each RGCNlayer and the answer-aware document representa-tion generation with trainable layer-wise weights.Similarly, the answer node representation of eachlayer and the last hidden state of the LSTMare stacked together to produce a more accuratedocument-level and global representation: H G = W c ([ v , v , ..., v L , H a ]) , (10) z = W g ([ v a , v a , ..., v La , h aM ]) . (11)where V (cid:96) = [ v , v , ..., v M ] is node representationsof the (cid:96) th layer, and ˆ v (cid:96)a is the answer node repre-sentation of the (cid:96) th layer. The W c and W g are thelayer-wise trainable weights. By doing so, the dif-ferent granularities of contextual representationsexpressing various types of semantics are aggre-gated to produce the final entity-level H G ∈ R M ∗ D and document-level z ∈ R D representation for thedecoder. EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
With a hidden state initialized to s = z , a uni-directional LSTM is utilized as the decoder to gen-erate the question, where the current hidden stateis updated given the previous generated word andthe previous hidden state: s t = LST M
Dec ([ w t ; c t − ] , s t − ) , (12)where the context vector c is computed with theattention mechanism (Bahdanau et al., 2014) byattending to the encoder hidden states: e t = H T G W e s t , (13) α t = Sof tmax ( e t ) , (14) c t = H T G α t . (15)To solve the Out-of-Vocabulary issue, we alsoexploit the copy mechanism to steer the model tocopy a word from the input (See et al., 2017; Gul-cehre et al., 2016). Specifically, in each step inthe decoder, a probability is computed, decidingto copy words from the input documents based onattention matrix or generate a word from the vocab-ulary via an output layer with softmax function: g copy = σ ( W c s t + U c c t + b c ) , (16) p generate ( y t ) = Sof tmax ( f ( s t , c t )) . (17)Finally, treating the copy probability as the at-tention weights (e.g., P copy = α t ), the final worddistribution is the summation of the probabilityof generating a word from the vocabulary and theprobability of copying a word from input: p final ( y t | y HOTPOTQA dataset is an accessibledataset collected from Wikipedia articles for themulti-hop reading comprehension task (Yang et al., 2018). We discard the questions with the “com-parison” type, and we only collect the text labeled“supporting facts” in the set of documents. Lack ofaccess to the original testing dataset, we combinethe training set and development set and randomlysplit them into the training set, development set,and testing set with the size of 68758, 4992, and4991 data samples, respectively. Baselines In the experiments, we compare theperformance of our proposed model and severalbaseline models as follows: • NQG++ (Zhou et al., 2017): It is a commonlyused baseline for the single-hop neural ques-tion generation task. The concatenated docu-ment text is passed into the seq-to-seq modelwith the answer positional embedding and en-riched lexical features (e.g., named entity, pos-of-tag, and case). Attention and copy mecha-nisms are adopted in the decoder. • Pointer-generator (PG) (See et al., 2017):Originally proposed for text summarizationtask, it is revised to solve the question genera-tion problem. The copy mechanism is realizeddifferently. We also add enriched lexical fea-tures in the embedding layer like NQG++. • Sentence-level Semantic Matching and An-swer Position Inferring (SM-API) (Ma et al.,2019): It is a state-of-the-art model on thesingle-hop neural question generation task. Itproposes two modules called sentence-levelsemantic matching and answer position infer-ring, trained jointly with the seq-to-seq modelto ask questions containing the right questionwords, keywords, and answer-aware seman-tics. • PG + GAT: Graph attention network(Veliˇckovi´c et al., 2017) updates the node rep-resentation by attending to the representationof its neighbors. One straight forward way formulti-hop reasoning is to utilize three layersof the graph attention model (GAT) on thebuilt answer-centric entity graph illustrated inSection 2.2. We evaluate model performancesin terms of BLEU1-4 (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014), and ROUGE-L (Lin, 2004) on HOTPOTQA dataset in Table 2. EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Table 2: Comparison of models performances in terms of the main metrics on HOTPOTQA dataset. Models BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-LNQG++ (Zhou et al., 2017) 44.55 33.18 26.57 21.99 24.35 41.08PG (See et al., 2017) 46.13 35.14 28.71 24.12 24.14 42.18SM-API (Ma et al., 2019) 46.95 35.76 29.02 24.34 24.30 42.32PG + GAT 47.35 36.10 29.85 24.98 24.56 42.62 Proposed 50.93 38.93 31.78 26.70 25.40 43.88 SM-API model only improves the PG modelby . on the BLEU-4 score and does not showconsiderable advantage on the multi-hop questiongeneration task. This is in part due to that the an-swer position inferring module designed explicitlyfor the single-hop answer position prediction doesnot offer a very accurate supervised signal for themodel training on the multi-hop dataset. On theother hand, the dataset does not include data sam-ples where questions about different answers areasked given the same context, thus limits the powerof the sentence-level semantic matching module.Stacking several layers of GAT directly on theLSTM encoder improves the performance by lever-aging the answer-centric entity graph for multi-hopreasoning; nevertheless the different semantic rela-tions and answer-focused entity representations areignored during the multi-hop reasoning.Our proposed multi-hop answer-focused reason-ing model achieves much higher scores than thebaselines as it leverages different granularity levelsof answer-aware contextual entity representationand semantic relations among the entities in thegrounded answer-centric entity graph, producingprecise and enriched semantics for the decoder. Downstream Task Metrics Main metrics hassome limitations as they only prove the proposedmodel can generate questions similar to the ref-erence ones. We further evaluate the generatedquestions in the downstream multi-hop machinecomprehension task.Specifically, we choose a well-trained Decom-pRC (Min et al., 2019), a state-of-the-art model formulti-hop machine comprehension problem on thesame HOTPOTQA dataset, to conduct the experi-ment. In general, DecompRC decomposes the com-plex questions requiring multi-hop reasoning into aseries of simple questions, which can be answeredwith single-hop reasoning. The performance of De-compRC on different generated questions reflectthe quality of generated questions and the multi-hop reasoning ability of the models, intuitively. Table 3: Performance of the DecompRC model in thedownstream machine comprehension task in terms ofEM and F1 score. Questions EM (%) F1 (%)Reference Questions 71.84 83.73NQG++ (Zhou et al., 2017) 65.82 76.97PG (See et al., 2017) 66.70 78.03SM-API (Ma et al., 2019) 67.01 78.43PG + GAT 67.23 79.01 Proposed 69.92 81.25 We report the Exact Match (EM) and F1 scoresachieved by the DecompRC model in Table 3,given the reference questions and different model-generated questions. The human-generated refer-ence questions have the best performance. TheDecompRC model achieves a much higher EMand F1 scores on the questions generated by ourproposed model than the baseline models. Analysis of Answer-focused Multi-hop Reason-ing The design of the answer-focused multi-hopreasoning model is to discover and capture the en-tities relevant to the answer utilizing the varioustypes of the semantic relations among them. Weanalyze the model effect by measuring the quan-tity of named entities in the generated question interms of the Precision and Recall, similar to (Sunet al., 2018). Quantitatively, Given a generatedquestion G = { g i } Ni =1 and its reference question R = { r i } Ni =1 , we define:Precision = g i and r i g i (19)Recall = g i and r i r i (20)where NEs indicates the number of named enti-ties.As reported in Table 4, our proposed model out-performs the baselines, indicating that our modelcan generate questions involving more answer-aware entities by leveraging the answer-focusedmulti-hop reasoning. EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Table 4: Comparison of Precision and Recall on differ-ent model-generated questions. Models Precision RecallNQG++ 46.59 52.66PG 46.64 52.45SM-API 46.82 53.10PG + GAT 47.30 53.44 Proposed 49.29 54.64 We further examine 100 generated questions withhuman evaluation by scoring them on a scale from1 to 5 in the light of semantic relatedness, fluency,and complexity. Semantic relatedness measureshow well a generated question matches with thedocuments and the answer. Fluency reflects thenaturalness of the generated questions, and com-plexity measures whether the generated questionsare complicated and involve multiple entities. Table 5: Human evaluation of Graph-based model andbaseline models. Models Semantic Relatedness Fluency ComplexityNQG++ 2.86 3.22 3.06PG 3.03 3.31 3.02SM-API 3.06 3.21 3.20PG + GAT 3.11 3.29 3.43 Proposed 3.20 3.34 3.71 As reported in Table 5, by leveraging the answer-focused multi-hop reasoning, the questions gener-ated by our approach are more complex and seman-tically relevant to the context and the answer thanthe baselines. Case Study Table 6 shows question samples gen-erated by the models. The PG and SM-API fail todiscover or capture the entities and their semanticrelation from document 1 (e.g., “Muriel Humphreymarried to Hubert Humphrey”) and asks the ques-tion about the “Hubert Humphrey served as the38th vice president of the united states” by onlyfocusing the semantics of the document 2.However, utilizing the grounded entity graph,GAT-based model generates a more complex ques-tion by involving the information of “MurielHumphrey married to Huber Humphrey”. Fur-thermore, by leveraging different granularity lev-els of the semantic relations among the entitieswith the answer-focused multi-hop reasoning, thequestion generated by our model is not onlymore complex by involving more semantics (e.g.,“Muriel Humphrey married to Huber Humphrey” and “Muriel Humphrey served as the Second Ladyof the United States and as a U.S. Senator fromMinnesota.”) but also more relevant to the answerthan the other models. We employ the Spacy toolkit (Honnibal and Mon-tani, 2017) to finish the tokenization, NER andPOS tagging, and coreference resolution. We usea 300-dim pre-trained Glove vector as the wordembedding. Following NQG++ (Zhou et al., 2017),we concatenate the word embedding with 16-dimanswer positional features and 16-dim linguisticfeature embedding, including the case, NER, andPOS tag features. We train the model of batch size for epochs with the Adam optimizer (Kingmaand Ba, 2014) by using a NVIDIA V100 GPU, ittakes minutes to train one epoch. We use theinitial learning rate of e − for the model training,and we halve it when the validation BLEU-4 scoredoes not improve on the dev dataset. We employbeam search with a size of during the inference. Single-relation Question Generation Existingwork dealing with the question generation taskcan be classified into two categories: rule-basedmethods and neural network-based methods. Rule-based approaches mainly adopt human-designedlinguistic templates or rules and are difficult, time-consuming and expensive to scale up. Meanwhile,the rigid templates also limit the diversity of thegenerated questions (Mazidi and Nielsen, 2014;Labutov et al., 2015).Recently, a series of neural network-based mod-els are proposed to solve the problem, as theyshow a flexible ability of understanding and gen-erating natural language, outperforming the rigidrule-based approaches. Du et al. (2017) firstly pro-poses the question generation task that asks a freequestion given the context. Then Zhou et al. (2017)propose to ask answer-relevant questions given theanswer and incorporate the linguistic feature em-bedding in the model. Sun et al. (2018) improvethe performance further by utilizing an additionalvocabulary for the question word generation andemploying the relative answer positional embed-ding. Ma et al. (2019) propose to train two generalmodules jointly with the seq-to-seq model duringtraining for generating the right keywords and ques-tion words and copying the answer-relevant words. EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Document 1 : [Muriel Humphrey Brown] Muriel Fay Buck Humphrey Brown (February 20, 1912 September 20,1998) was an American politician who served as the Second Lady of the United States and as a U.S. Senatorfrom Minnesota Married to the 38th Vice President of the United States, Hubert Humphrey. Document 2 : [Hubert Humphrey] Hubert Horatio Humphrey Jr. (May 27, 1911January 13, 1978) was anAmerican politician who served as the 38th Vice President of the United States from 1965 to 1969. Reference : who is the minnesota senator that was married to muriel humphrey and served as the 38th vicepresident of the united states ? PG : who was an american politician who served as the 38th vice president of the united states from 1965 to 1969 ? SM-API : who served as the 38th vice president of the united states from 1965 to 1969 ? PG+GAT : who married to Hubert Humphrey who served as the 38th vice president of the united states from1965 to 1969 ? Proposed : muriel humphrey brown was an american politician who served as the second ladyof the united states and as a u.s. senator from minnesota married to which american politician who served asvice president of the united states from 1965 to 1969 ? Table 6: Case study for showing the superiority of leveraging structured graph data with linguistic relations. However, the existing models mainly focus ongenerating questions with the single-relation con-text. Different from previous works, we proposea new challenging task to ask complex questionsfrom a collection of documents, requiring themodel to discover and reasoning the entities andthe semantic relations among them. Graph Neural Network on NLP tasks Lever-aging GNN-based models for NLP tasks gainedhuge popularity recently. GNN-based models aremainly adopted to model the semantic and syntac-tic information from natural language text. Zhanget al. (2018) employs the GCN model to tacklerelation extraction on the dependence tree. A re-current graph-based model is proposed to solve thebAbI task given the graph-structured input (Li et al.,2015). Liu et al. (2019) apply the GCN model onthe dependence tree parsed from the input sentenceto predict clue for asking questions. Multi-hop Reasoning Some works are proposedto realize multi-hop reasoning for question answer-ing given multiple documents. Yoon et al. (2019)apply Graph Neural Network on a structured graphbuilt with sentences, documents, and query nodesto classify the supporting facts used for answeringthe query. Min et al. (2019) realize the multi-hopreasoning by decomposing the multi-hop query intosingle-hop queries. To model the different levelsof semantics, a heterogeneous graph consisting ofentities, documents, and candidates as nodes (Tuet al., 2019), which inspires our idea of the answer-centric entity graph.Different from existing models, our proposedmodel is always answer-focused during the multi-hop reasoning. We proposed multi-hop answer- focused reasoning with RGCN facilitates modelingthe different levels of semantic information. In this paper, we proposed a new task that askscomplex questions given a collection of documentsand the corresponding answer by discovering andmodeling the multiple entities and their semantic re-lations across the documents. To solve the problem,we propose answer-focused multi-hop reasoning byleveraging different granularity levels of semanticinformation in the answer-centric entity graph builtfrom natural language text. Extensive experimentresults demonstrate the superiority of our proposedmodel in terms of automatically computed metricsand human evaluation. Our work provides a base-line for the new task and sheds light on future workin the multi-hop question generation scenario.In the future, we would like to investigatewhether commonsense knowledge can be incor-porated during the multi-hop reasoning to ask rea-sonable questions. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation of EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. gated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555 .Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In Proceedings of the ninthworkshop on statistical machine translation , pages376–380.Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-ing to ask: Neural question generation for readingcomprehension. arXiv preprint arXiv:1705.00106 .Caglar Gulcehre, Sungjin Ahn, Ramesh Nallap-ati, Bowen Zhou, and Yoshua Bengio. 2016.Pointing the unknown words. arXiv preprintarXiv:1603.08148 .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory. Neural computation ,9(8):1735–1780.Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear , 7.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Igor Labutov, Sumit Basu, and Lucy Vanderwende.2015. Deep questions without deep understanding.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing , volume 1, pages 889–898.Yujia Li, Daniel Tarlow, Marc Brockschmidt, andRichard Zemel. 2015. Gated graph sequence neuralnetworks. arXiv preprint arXiv:1511.05493 .Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text summarizationbranches out , pages 74–81.Bang Liu, Mingjun Zhao, Di Niu, Kunfeng Lai,Yancheng He, Haojie Wei, and Yu Xu. 2019. Learn-ing to generate questions by learningwhat not to gen-erate. In The World Wide Web Conference , pages1106–1118.Xiyao Ma, Qile Zhu, Yanlin Zhou, Xiaolin Li,and Dapeng Wu. 2019. Improving questiongeneration with sentence-level semantic matchingand answer position inferring. arXiv preprintarXiv:1912.00879 .Karen Mazidi and Rodney D Nielsen. 2014. Linguisticconsiderations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , volume 2, pages 321–326. Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han-naneh Hajishirzi. 2019. Multi-hop reading compre-hension through question decomposition and rescor-ing. arXiv preprint arXiv:1906.02916 .Liangming Pan, Yuxi Xie, Yansong Feng, Tat-SengChua, and Min-Yen Kan. 2020. Semantic graphsfor generating deep questions. arXiv preprintarXiv:2004.12704 .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. arXiv preprint arXiv:1802.05365 .Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250 .Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne Van Den Berg, Ivan Titov, and Max Welling.2018. Modeling relational data with graph convolu-tional networks. In European Semantic Web Confer-ence , pages 593–607. Springer.Abigail See, Peter J Liu, and Christopher D Man-ning. 2017. Get to the point: Summarizationwith pointer-generator networks. arXiv preprintarXiv:1704.04368 .Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, YanjunMa, and Shi Wang. 2018. Answer-focused andposition-aware neural question generation. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 3930–3939.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In Advances in neural information processing sys-tems , pages 3104–3112.Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xi-aodong He, and Bowen Zhou. 2019. Multi-hopreading comprehension across multiple documentsby reasoning over heterogeneous graphs. arXivpreprint arXiv:1905.07374 .Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2017. Graph attention networks. arXiv preprintarXiv:1710.10903 .Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 189–198. EMNLP 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W Cohen, Ruslan Salakhutdinov, andChristopher D Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. arXiv preprint arXiv:1809.09600 .Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim,Trung Bui, and Kyomin Jung. 2019. Propagate-selector: Detecting supporting sentences for ques-tion answering via graph neural networks. arXivpreprint arXiv:1908.09137 .Yuhao Zhang, Peng Qi, and Christopher D Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. arXiv preprintarXiv:1809.10185 .Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2017. Neural ques-tion generation from text: A preliminary study. In