OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach
Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, Haiqing Chen
OOneStop QAMaker: Extract Question-Answer Pairsfrom Text in a One-Stop Approach
Shaobo Cui [email protected] Academy, Alibaba Group
Xintong Bao [email protected] Academy, Alibaba Group
Xinxing Zu [email protected] Academy, Alibaba Group
Yangyang Guo [email protected] University
Zhongzhou Zhao [email protected] Academy, Alibaba Group
Ji Zhang [email protected] Academy, Alibaba Group
Haiqing Chen [email protected] Academy, Alibaba Group
ABSTRACT
Large-scale question-answer (QA) pairs are critical for advancingresearch areas like machine reading comprehension and questionanswering. To construct QA pairs from documents requires de-termining how to ask a question and what is the correspondinganswer. Existing methods for QA pair generation usually follow apipeline approach. Namely, they first choose the most likely candi-date answer span and then generate the answer-specific question.This pipeline approach, however, is undesired in mining the mostappropriate QA pairs from documents since it ignores the connec-tion between question generation and answer extraction, whichmay lead to incompatible QA pair generation, i.e., the selected an-swer span is inappropriate for question generation. However, forhuman annotators, we take the whole QA pair into account andconsider the compatibility between question and answer. Inspiredby such motivation, instead of the conventional pipeline approach,we propose a model named OneStop generate QA pairs from doc-uments in a one-stop approach. Specifically, questions and theircorresponding answer span is extracted simultaneously and theprocess of question generation and answer extraction mutually af-fect each other. Additionally, OneStop is much more efficient to betrained and deployed in industrial scenarios since it involves onlyone model to solve the complex QA generation task. We conductcomprehensive experiments on three large-scale machine readingcomprehension datasets: SQuAD, NewsQA, and DuReader. Theexperimental results demonstrate that our OneStop model outper-forms the baselines significantly regarding the quality of generatedquestions, quality of generated question-answer pairs, and modelefficiency.
CCS CONCEPTS β’ Information systems β Question answering ; β’
Computingmethodologies β Natural language generation . KEYWORDS
Question generation, Question-Answer pair generation, OneStopapproach, Multi-task learning
Many tasks in the natural language processing community suchas machine reading comprehension and question answering [12,16, 28] rely heavily on large amounts of human-labeled question-answer (QA) pairs. However, manually annotating QA pairs byhuman [2, 16, 28] is both costly and time-consuming. Recently, howto automatically extract QA pairs from documents has attractedincreasing attention.The task of QA pair extraction from a document π is to extractthe most related QA pair: arg max π,π π ( π, π | π ) . Most of existingworks [1, 7, 29, 36, 38] adopt a pipeline approach, in which firstly se-lects candidate answer spans from the document: arg max π π ( π | π ) ,and then generate the answer-specific questions: arg max π π ( π | π, π ) .We present the simplified view of the pipeline approach in Figure 1.This type of pipeline approach, however, suffers two major draw-backs. Firstly, There is no explicit correlation between the questiongeneration and the answer extraction process. Namely, the questiongeneration model and the answer extraction model are isolated during their training process. This isolation leads the extracted Table 1: Instances to illustrate QAβs incompatibility.
Type UtteranceExample 1 Approach First generate a question and then find the generated questionsβanswer span in document.Incompatibilitytype Generate a question that is hard to find their answer in documentby the answer extraction model.Document The delta is delimited in the West by the in the East by a moderncanalized section.Question What does delta look like?Answer ββ Example 2 Approach First predict a candidate answer span and then generate an answer-aware questionIncompatibilitytype The predicted answer span is too detailed, incorrect or unsuitablefor answer-aware question generation.Document The French crownβs refusal to allow non-Catholics to settle inNew France may help to explain that colonyβs slow rate of popula-tion growth compared to that of the neighbouring British colonies,which opened settlement to religious dissenters.Answer Span may help toQuestion ββ QA pairs to be incompatible : the question generation model maygenerate questions that are hard to find their corresponding an-swers by the answer extraction model(see example 1 in Table 1), a r X i v : . [ c s . C L ] F e b oodstock β18, June 03β05, 2018, Woodstock, NY Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, and Haiqing Chen or the answer extraction may extract answer span that are notsuitable for question generation(see example 2 in Table 1). Thisincompatibility can be explained by Figure 1. These two separatesteps: { arg max π π ( π | π ) , arg max π π ( π | π, π ) } are not an accurate ap-proximation for arg max π,π π ( π, π | π ) . Secondly, this type of pipelinemethods are knotty and time-consuming to be trained and deployedin the industrial online application since they involve at least twomodels and the cumulative error along the pipeline is huge. pipeline-stylized O n e S t op higherlower Figure 1: A simplified view of the pipeline approach to QApair generation. The QA pair under consideration is denotedas a black dot. In the pipeline settings, the expected QA pairis firstly pushed in the direction of maximizing π ( π | π ) andthen in the conditional direction of maximizing π ( π | π, π ) . On-eStop approach, however, optimizes in the direction of max-imizing π ( π, π | π ) straightforwardly. Unlike the aforementioned pipeline approach, human annotatorsusually take the whole QA pair into consideration and pay closeattention to the compatibility between the extracted answer and thegenerated question. More specifically, from human annotatorsβ per-spective, a question that is less likely to be answered by referring tothe given document should not be generated in the question genera-tion process. Similarly, an answer whose corresponding question isinferior or unsuitable for question generation should be given lessattention in the answer extraction process. In a nutshell, human an-notators consider the compatibility and overall quality of QA pairs.Inspired by the limitation of existing pipeline methods and theaforementioned motivation for QAβs compatibility, we integrate thequestion generation and the answer extraction into a unified frame-work to enhance the compatibility of generated question and theextracted answer. We propose OneStop, an architecture which canbe easily adapted from existing pre-trained language models to ex-tract QA pairs from documents in a OneStop approach. The OneStopmodel takes documents as input and outputs questions π and ques-tionsβ corresponding answer spans π . The answer extraction andthe question generation module in the OneStop model collaboratetogether to find the most compatible QA pairs. Specifically, our On-eStop model tackles the objective arg max π ( π, π | π ) directly insteadof the decomposed objectives: { arg max π ( π | π ) , arg max π ( π | π, π )} .These two tasks in our OneStop model mutually affect each other:(1) the answer span extraction task pushes the question generationmodel to generate more answerable questions since it is hard to ex-tract the answer span of an unanswerable question; (2) the questiongeneration task could further enhance the answer extraction modelby providing the probability of generating a question. Specifically, the answer extraction model places more attention on questionsfavored by the question generation model, i.e., the question whose π ( π | π ) is large. Additionally, by combining the question generationmodel and answer extraction model in one single model, our On-eStop model is much lighter than the existing pipeline approachthat involves at least two models.As for the model structure, OneStop model adopts the conven-tional transformer-based sequence-to-sequence structure. Our On-eStop model can be easily built upon pre-trained model such asBART [21], T5 [27], ProphetNet [26] and so on. The training objec-tive of the OneStop model is to generate a suitable question andpredict the right answer span for this question simultaneously. Toverify the effectiveness of our OneStop model, we conduct exper-iment on three large-scale datasets: SQuAD [28], NewsQA [34],and DuReader [10]. We compare the involved baselines in terms ofthe quality of generated questions, the quality of QA pairs, andmodel efficiency. Experimental results prove that our OneStopmodel achieves SOTA performance in a more efficient way. Thecontributions of this paper are summarized as follows:(1) We propose a unified framework in which the answer extractionmodule and the question generation module could mutuallyenhance each other.(2) To our best knowledge, OneStop is the first transformer-basedmodel for generating more compatible QA pairs from docu-ments in a one-stop approach.(3) OneStop can be easily built upon existing pre-trained languagemodels. Compared with previous pipeline approaches, our On-eStop model is much more efficient to train and deploy in in-dustrial scenarios and requires much less human effort.(4) We conduct comprehensive experiments on three large-scaledatasets to evaluate our OneStop model in terms of questiongeneration, QA pair generation and model efficiency. Question Generation.
Question generation [3, 17, 24, 31, 32,39, 40] is a well-studied natural language processing task in lit-erature. There are mainly two types of approaches for questiongeneration: template-based and model-based. Methods [11, 19] inthe first category rely on human efforts to design the templaterule and are thus unscalable across datasets. In contrast, the model-based methods [39, 40] employ an end-to-end neural network togenerate questions, which takes as inputs selected key phrases anddocuments. However, these methods are limited as the questionscannot be generated from documents directly. An additional entityextraction model or a sequence labeling model [31, 36] is requiredto determine which part of the document is worthy of being asked.As a result, it is less practical for this kind of methods in questiongeneration due to the following two facts: (1) the key phrase extrac-tion model demands addition manual labor and elaborated tuning;(2) The most question-worthy phrases in a document are difficultto be identified.
Question-Answer Pair Generation.
Most of existing works[1, 8, 15, 18, 20, 23] focusing on the QA pair generation follow apipeline fashion: (1) determine what points in the document shouldbe asked; (2) Learn to ask based on the selected points; (3) Detectthe answer span of the question in the document; Du and Cardie neStop QAMaker: Extract Question-Answer Pairsfrom Text in a One-Stop Approach Woodstock β18, June 03β05, 2018, Woodstock, NY firstly detected the question-worthy answer (which they dubbed asanswer span identification) and then generated the answer-awarequestion. Similarly, Golub et al. proposed a two-stage SynNet forQA pair generation, which consists of an answer tagging moduleand a question synthesis module. Alberti et al. proposed to generateQA pairs with models of question generation and answer extractionand then filtered the results with roundtrip consistency.
Joint Models for Question Generation and Question Answer-ing.
There have been studies [5, 30, 33, 37] focusing on solvingquestion generation and question answering together. In thesemethods, the input and output of question generation and ques-tion answering are inverse, which makes them dual tasks. In thisway, question generation and question answering are implementedwith separate models connected by their duality. However, thetraining objective of question answering poses an adverse effecton the performance of the question generation model due to theenforcement of dual constraint. Our work is different from thesework [5, 30, 33, 37], which focus on the duality of question gen-eration and question answering. Firstly, for QA extraction fromdocuments, there is not an explicit duality between question gen-eration and answer extraction. Consequently, the duality betweenthese two tasks no longer exists. Secondly, question generationand answer extraction in OneStop are optimized in a multi-tasklearning approach. Namely, they are optimized simultaneously tofind a compatible and optimal solution for QA pair generation.
Given a document, the objective of QA pair generation is to findthe related QA pairs. Mathematically:Β― π, Β― π = arg max π,π π ( π, π | π ) , (1)where document π is a sequence of utterances; answer π shouldbe a sub span from the document and question π is an utterancethat is closely associated with π . Based on this formulation, existingmethods can be classified into the following two groups: Document Candidate answerdetection AnswerQuestion GenerationQuestion (a) D2A2Q
Document Question generation QuestionMachine readingcomprehensionAnswer (b) D2Q2A
QuestionAnswerDocument
OneStop (c) OneStop
Figure 2: The comparison of D2A2Q, D2Q2A ,and OneStop. (1)
D2A2Q : The candidate answer is first extracted from the docu-ment: π ( π | π ) , after which the answer-specific question is gener-ated based on the document and the extracted candidate answer: π ( π | π, π ) . It can be summarized as:arg max π,π π ( π, π | π ) β (cid:40) arg max π π ( π | π ; π d2a ) , Step Iarg max π π ( π | π, π ; π da2q ) , Step II (2)where π d2a and π da2q are the parameters of candidate answer ex-traction model and answer-specific question generation modelrespectively.(2) D2Q2A : It firstly generates question that is most likely to beasked from the document, i.e., π ( π | π ; π d2q ) . And then the gen-erated question is utilized to extract its corresponding answerspan from the document. Similarly, D2Q2A approach can besummarize as:arg max π,π π ( π, π | π ) β (cid:40) arg max π π ( π | π ; π d2q ) Step Iarg max π π ( π | π, π ; π dq2a ) Step II (3)where π d2q and π dq2a are parameters of question generationmodel and answer extraction (machine reading comprehension)model respectively.The aforementioned pipeline approaches such as D2Q2A andD2A2Q are all quite rough approximation to the original objectiveof arg max π,π π ( π, π | π ) . The cumulative error is magnified alongthese pipelines. Additionally, the training cost and inference effi-ciency are unfavorable. Motivated by these limits, we propose theOneStop model that models the objective much more precisely. TheOneStop framework can be formulated as:arg max π,π π ( π, π | π ) = arg max π,π π ( π | π ; π ) Β· π ( π | π, π ; π ) (4)where π is the parameters of the OneStop model. The answer extrac-tion module and the question generation module in the OneStopmodel share the model parameters π , which means these two tasksinfluence each other. As can be observed, our OneStop model iseasier to train and more efficient to use during inference since itinvolves only one model. We present the comparison of these threedifferent approaches in Figure 2. As we can see, both D2A2Q andD2Q2A are pipeline approaches. Nevertheless, our OneStop modeltackle the original QA pair generation objective directly. In this section, we firstly present the overview of the OneStopmodel in Section 4.2. Section 4.3 and Section 4.4 are about thequestion generation and answer span extraction module of theOneStop model respectively, after which we end this section withthe training and inference of OneStop model in Section 4.5.
Inspired by the superiority of transformer [35] in utterance repre-sentation, we adopt the self-attentive unit as the basic unit of en-coder and decoder in our OneStop model. As shown in Figure 3, eachself-attentive unit consists of a self-attention layer and a position-wise fully connected feed-forward layer. Each of these two layers For D2Q2A, the final output along the pipeline arg max π π ( π | π ) , arg max π π ( π | π,π ) are unlikely to be the optimal solution for arg max π,π π ( π, π | π ) . A similar conclusioncan be obtained for D2A2Q. oodstock β18, June 03β05, 2018, Woodstock, NY Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, and Haiqing Chen is employed with residual connection, followed by layer normal-ization. More specifically, the whole computation process in theself-attentive module can be summarized as:Att ( Q , K , V ) = Softmax ( QK T βοΈ π π ) VX = π norm ( Q + Att ( Q , K , V )) X = max ( , X W + b ) W + b π att ( Q , K , V ) = π norm ( X + X ) where π π is the model dimension. π att ( Q , K , V ) β R π‘ Γ π π , where π‘ isthe input length, Q β R π‘ Γ π π , K β R π‘ Γ π π and V β R π‘ Γ π π . W , b , W and b are learnable model parameters. The computation processof self-attentive module is shown in Figure 3. A tt e n ti on A dd & N o r m F ee d F o r w a r d A dd & N o r m QKV
Figure 3: Self-attentive module.
The overview of our OneStop model is presented in Figure 4. Our
Pre-trained Encoder Pre-trained Decoder
A B C D E FDocument X Y ZX Y Z
AttentionΒ Β K V Q
TrueAnswer Span Answer SpanPredictorEncoder outputs QuestionPredicted Answer Span
Figure 4: Overview of our proposed OneStop model.
OneStop model uses the canonical sequence-to-sequence trans-former [35] architecture. Here we take BART model as an exampleto illustrate our model structure. Note that our OneStop approachcan be easily modified from other pre-trained language models suchas T5 [27] or ProphetNet [26]. OneStop model consists of a bidi-rectional encoder and an auto-regressive decoder. The encoder inthe OneStop model takes the document as input, and each decoderlayer performs cross attention over the final hidden layer of theencoderβs outputs. The decoder decodes the generated question inan auto-regressive approach. The start and end position of answerspan are predicted based on the encoder outputs and the decoderβsoutputs at
As described above, the input of the encoder is the document π and the output of the decoder is expected to be the question π . Thecross entropy loss for question generation is denoted as: Ξ¦ lm = β | π | βοΈ π‘ = log π ( π π‘ | π < π‘ , π ; π ) , (5)where | π | is the length of the question and π ( π π‘ | π < π‘ , π ; π ) is thepredicted probability for token π π‘ . After we obtain the generatedquestion, we use the decoder outputs at
As introduced in Equation 4, wehave: π ( π, π | π ; π ) = π ( π | π ; π ) Β· π ( π | π,π ; π ) = (cid:18) | π | (cid:214) π‘ = π ( π π‘ | π < π‘ ,π ; π ) (cid:19) Β· (cid:18) π start ( π start | π,π ; π ) Β· π end ( π end | π,π ; π ) (cid:19) (8)The negative log-likelihodd of OneStop model can be expressed: neStop QAMaker: Extract Question-Answer Pairsfrom Text in a One-Stop Approach Woodstock β18, June 03β05, 2018, Woodstock, NY Ξ¦ = β log π ( π, π | π ; π ) = β | π | βοΈ π‘ = log π ( π π‘ | π < π‘ , π ; π ) β log π start ( π start | π, π ; π )β log π end ( π end | π, π ; π ) = Ξ¦ lm + Ξ¦ start + Ξ¦ end (9)We use a generalization of OneStop objective which introduces ahyperparameter π that balance question generation and answerextraction: Ξ¦ = π Β· Ξ¦ lm + ( β π ) Β· ( Ξ¦ start + Ξ¦ end ) (10) Training Algorithm
The training algorithm of the OneStopmodel is described in Algorithm 1.
Algorithm 1:
Training algorithm of OneStop model.
Input : ( π, π, π ) triples, a pre-trained BART languagemodel. Output :
OneStop model which takes the document as inputand outputs QA pairs. Load the pre-trained BART model as the initial checkpointof generation parts of the OneStop model; Fine-tune the question generation part of the OneStopmodel with πΎ = Ξ¦ = Ξ¦ lm ; Fine-tune the answer prediction part of the OneStop modelwith πΎ = Ξ¦ = Ξ¦ start + Ξ¦ end ; Determine the value of πΎ ; while not converge do Fine-tune the OneStop model with Ξ¦ = π Β· Ξ¦ lm + ( β π ) Β· ( Ξ¦ start + Ξ¦ end ) ; endInference of OneStop Model In the inference phase, we feedthe document into the OneStopβs encoder, the question is generatedfrom OneStopβs decoder in an auto-regressive approach. The startand end position of the answer span are predicted by the answerspan predictor network. With the start and end position, we canobtain the answer span in the document for the generated question.OneStop model also supports for generating multiple QA pairs forlong documents, i.e., the long document could be split as multiplesub-documents and OneStop could generate the most related QApairs for each sub-document.
In this section, we mainly elaborate the datasets, involved baselines,evaluation metrics, and model settings sequentially.
In this paper, we conducted experiments on three large-scale ma-chine reading comprehension datasets to evaluate the performanceof our proposed OneStop model. β’ SQuAD [28]: SQuAD consists of questions posed by crowdwork-ers on Wikipedia articles, and the corresponding answer is asubspan of the corresponding articles. β’ NewsQA [34]: the documents in NewsQA are articles collectedfrom CNN news. Similar to SQuAD, questions are acquired throughcrowd-sourcing while the answer is a subspan of documents. β’ DuReader [10]: DuReader is an open-domain machine readingcomprehension dataset in which questions are collected fromreal anonymized user queries. The documents and the answersare acquired using the search engine.Besides, the answer should be subspan of the correspondingdocument. In this setting, we filtered out the data item in DuReaderwhose answer is not part of the document. The QA pair associatedwith one document should be unique. However, for SQuAD andNewsQA, one long document may have more than one QA pair. Forthis reason, we split the long document into multiple sub-documentto ensure that each sub-document contain only one QA pairs. Welist the statistics of the modified datasets in Table 2.
Table 2: The statistics of the filtered datasets.
SQuAD NewsQA DuReader
To evaluate our proposed OneStop modelβs performance, we com-pare our model with two types of baselines. The first type is themodels for question generation, which is to evaluate the quality ofgenerated questions. The second type is about QA pair generationbaselines, which is to evaluate the quality of generated QA pairs.
Baselines for Question Generation
We used the followingmodels as the baselines for the evaluation of question generation. β’ DeepNQG : the neural question generation model proposed in[8], an end-to-end model implemented with GRU module. β’ CRF-DeepNQG : we followed the conventional setting in theD2A2Q approach, which firstly selects the most likely answerspan from the document and then utilizes the extracted answerspan and the document to generate a question. The answer ex-traction (AE) is defined as a sequence tagging task implementedwith a BiLSTM-CRF model [7, 14]. The embedding of the doc-ument and the extracted answer are concatenated together togenerate the answer-specific question. If the answer extractionmodel predicts more than one answer tag, we randomly selectedone span from the span set as the answer to generate the question.If no answer tag is predicted, we viewed the whole document asthe selected answer span. β’ BART-QG : a fine-tuned model from a pretrained BART [21]model on the question generation task, whose input is documentand output is question. β’ BART-A2QG : a fine-tuned model from a pretrained BART [21]model, whose input is answer and output is question. This modelis to explore the utility of answer directly in question generation.
Baselines for Question-Answer Pair Generation: Pipelinevs. OneStop
For pipeline QA pair generation methods, we uti-lized the aforementioned question generation model for question oodstock β18, June 03β05, 2018, Woodstock, NY Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, and Haiqing Chen
Table 3: The comparison of baselines on question generation.
Models SQuAD NewsQA DuReaderBLEU-1 BLEU-2 Rouge-1 Rouge-2 Rouge-L BLEU-1 BLEU-2 Rouge-1 Rouge-2 Rouge-L BLEU-1 BLEU-2 Rouge-1 Rouge-2 Rouge-LDeepNQG 17.49 8.81 17.54 4.53 17.77 14.30 6.22 14.64 2.81 14.79 3.14 1.72 4.67 1.32 4.72CRF-DeepNQG 19.61 9.68 19.10 4.74 18.92 17.06 7.93 17.07 3.73 17.11 0.70 0.53 7.66 4.10 7.76BART-QG
BART-A2QG 20.85 10.51 21.50 5.50 18.81 21.53 11.81 23.29 6.96 21.75 40.61 33.94 42.85 29.32 38.58
OneStop 31.32 21.28 32.77 14.79 29.10 22.28 13.46 23.39 8.39 21.90 45.19 38.35 47.56 33.59 43.16 generation. To obtain the corresponding answer to the generatedquestion, we chose BERT [6] as the answer extraction model. Inthis setting, we have the following QA pair generation approaches: β’ Existing Pipeline Approachβ
DeepNQG + BERT-MRC : the
DeepNQG model for questiongeneration and the BERT model for answer extraction. β CRF-DeepNQG + BiLSTM-CRF : as described above, the answerextraction model is implemented with a BiLSTM-CRF model.The BiLSTM-CRF modelβs tagged phrase is used as the answer π while the answer-aware question is used as the question π corresponding to π . β BART-A2QG + BERT-MRC : the BART-A2QG for question gen-eration and the BERT model for answer extraction. β BART-QG + BERT-MRC : BART model is used as the ques-tion generation model. The BERT model is used for answerextraction. β’ Our Methodsβ
OneStop : our proposed OneStop model involves only onemodel, in which the process of question generation and answerextraction is simultaneous and affects each other. β OneStop + BERT-MRC : the approach in which we used thequestion generated by OneStop and the answer extracted bythe BERT model.
We evaluated the involved baselines from two aspects: (1) the simi-larity between generated questions and the ground-truth; (2) thequality of the generated QA pairs.
Similarity Between Generated Questions and Ground-Truth
We chose BLEU-1, BLEU-2 [25], Rouge-1, Rouge-2, and Rouge-L [22]as the evaluation metrics to evaluate the similarity between thegenerated questions and the ground-truth questions.
Quality of Generated Question-Answer pairs
Since there isno widely-accepted automatic metrics on the quality of QA pairs,we used the score from two human annotators as the quality ofgenerated QA pairs. The specific scoring criteria of the humanraters are given as follows: β’ Score 0 : if any of the following cases are encountered, the QApairs is given a score of 0. (1) there are serious grammaticalerrors in the question; (2) the question is an empty string; (3)the question is totally unrelated to the given document; (4) thequestion is unanswerable, i.e., the question cannot be answeredby referring to the document; (5) The question and the answerare totally unrelated. β’ Score 0.5 : the question is partly related to the document. β’ Score 1 : the question is closely related to the document and isgrammatically correct, but the answer is not associated with thequestion. β’ Score 1.5 : the question is closely related to the document and isgrammatically correct, but the answer can only partially answerthe question or contains redundant information. β’ Score 2 : the question is closely related to the document, and theanswer can precisely and concisely reply to the question.These involved baselines are compared based on an average overhuman ratersβ scores.
The encoder and the decoder in all the involved pre-trained lan-guage models contain 6 layers and a hidden size of 768. We utilizedone well-trained English pre-trained BART model as the initialcheckpoint of
BART-A2QG , BART-QG and
OneStop on SQuADand NewsQA datasets. For models on the DuReader dataset, we pre-trained the BART language model on a very-large Chinese Baikedataset as the initial checkpoint of
BART-A2QG , BART-QG and
OneStop . In our experiments, we set πΎ = .
2. The beam size is setto be 3. The batch size is 16 and epoch is set to 4. We chose Adamas our optimizer. The learning rate is set to 1e-4 with a warmupratio of 0.05. The dropout rate π = .
1. All the experiment is runwith P100 GPUs.
We list the baselinesβ performance on question generation in Table 3.As can be observed,
OneStop model achieves better or comparableperformance than the baselines on question generation. The per-formance of DeepNQG is quite poor since the RNN-based modelcannot handle the long document well. The comparison between
BART-QG and
OneStop proves that the answer extraction does notdegenerate the quality of generated questions. It even improvesthe performance of generated question models. This phenomenoncan be explained by the fact that the probability of answer extrac-tion π ( π | π, Λ π ) can further enhance the question generation model π ( π | π ; π d2q ) . We list the result of QA evaluation in Table 4. From the results, wehave the following observations:(1)
OneStop outperforms pipeline methods:
CRF-DeepNQG + BiLSTM-CRF, DeepNQG + BERT-MRC , and
BART-A2QG + BERT-MRC significantly and achieves a human raterβs score 1.41, which neStop QAMaker: Extract Question-Answer Pairsfrom Text in a One-Stop Approach Woodstock β18, June 03β05, 2018, Woodstock, NY prove the effectiveness of the OneStop model on question-answer pair generation.(2) Compared with
OneStop , OneStop + BERT-MRC sees an addi-tional performance improvement. The difference between
On-eStop (1.41) and
OneStop + BERT-MRC (1.67) proves that theanswer extraction module in OneStop model is not as good asthe answer extraction model implemented with BERT. This canbe explained by the fact that the BERT model for answer extrac-tion has 12 transformer layers, which has better representationcapacity on the document and question encoding. However,both the encoder and decoder in the OneStop model have only6 transformer layers, which has a less satisfying representationability.(3) The comparison between
BART-QG + BERT-MRC and
OneStop+ BERT-MRC is to verify the effectiveness of answer extrac-tion module on OneStopβs question generation module. Theimprovement (from 1.61 to 1.67) demonstrates that the answerextraction module in the OneStop model enhances the qualityof QA pairs.
Table 4: Result of generated question-answer pairs.
Approach Score Approach ScoreCRF-DeepNQG + BiLSTM-CRF 0.22 BART-QG + BERT-MRC 1.61DeepNQG + BERT-MRC 0.20 OneStop 1.41BART-A2QG + BERT-MRC 0.24 OneStop + BERT-MRC
Another significant advantage of our OneStop model is efficiency:(1) We list the number of parameters of each QA pair generationapproach in Table 5. As we can see, OneStop model is one ofthe lightest model for QA pair generation.(2) Different from the pipeline approaches that involve more thanone model, which requires additional efforts and computationalresource to train and deploy these models. OneStop, neverthe-less, involves only one model, which is much more efficient forboth training and deployment.(3) The pipeline baselines require additional human efforts whendeployed online. For instance, for the answer extraction (AE)model in D2A2Q, it may select none or more than one answerspan, which requires well-designed rules to select from theseanswer span for question generation.
Table 5: The number of parameters (millions) of each QApair generation approach.
Approach SQuAD NewsQA DuReaderCRF-DeepNQG + BiLSTM-CRF 109 71 163DeepNQG + BERT-MRC 151 146 323BART-A2QG + BERT-MRC 248 248 423BART-QG + BERT-MRC 248 248 423OneStop 142 142 121OneStop + BERT-MRC 253 253 427
We present several QA pairs generated by our OneStop model in Ta-ble 6. We also include the answer span predicted by the BERT model.Based on our observation, in most cases, the answer predicted bythe OneStop model is same as that of BERT. But for certain cases,the answer extraction in OneStop may tend to include the relatedinformation besides the precise answer span. Most of the extractedQA pairs from our OneStop model can be applied in downstreamtasks like question answering.
Table 6: Question-answer pairs generated by OneStop. text document
The French crownβs refusal to allow non-Catholics to settle inNew France may help to explain that colonyβs slow rate of pop-ulation growth compared to that of the neighbouring Britishcolonies, which opened settlement to religious dissenters.
OneStopquestion
What did the French government refusal to allow?
BERT-MRCanswer non-Catholics
OneStopanswer non-Catholics to settle in New France document
The delta is delimited in the West by the Alter Rhein ("Old Rhine")and in the East by a modern canalized section.
OneStopquestion
What is the delta delimited by?
BERT-MRCanswer
Alter Rhein
OneStopanswer
Old Rhine
Existing pipeline QA pair generation approaches suffer problemslike incompatible and sub-optimal solutions, inefficiency, and heavyhuman effort. This paper proposes a transformer-based sequence-to-sequence model to generate QA pairs in a one-stop fashion. Ourmodel achieves state-of-the-art performance on question genera-tion and QA pair generation on three large-scale machine readingcomprehension datasets in a more efficient way. Our work shedslight on a novel One-Stop approach to QA pair extraction. We willexplore more effective techniques of generating QA pairs such asthe soft approach of answer extraction and copy mechanism.
REFERENCES [1] Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, and Michael Collins. 2019.Synthetic QA Corpora Generation with Roundtrip Consistency. In
Proceedings ofthe 57th Annual Meeting of the Association for Computational Linguistics . 6168β6173.[2] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprintarXiv:1506.02075 (2015).[3] Ying-Hong Chan and Yao-Chung Fan. 2019. A Recurrent BERT-based Model forQuestion Generation. In
Proceedings of the 2nd Workshop on Machine Reading forQuestion Answering . 154β162.[4] Kyunghyun Cho, Bart van MerriΓ«nboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning PhraseRepresentations using RNN EncoderβDecoder for Statistical Machine Translation.In
Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 1724β1734.[5] Shaobo Cui, Rongzhong Lian, Di Jiang, Yuanfeng Song, Siqi Bao, and Yong Jiang.2019. DAL: Dual Adversarial Learning for Dialogue Generation. In
Proceedingsof the Workshop on Methods for Optimizing and Evaluating Neural LanguageGeneration . 11β20. oodstock β18, June 03β05, 2018, Woodstock, NY Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, and Haiqing Chen [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) . 4171β4186.[7] Xinya Du and Claire Cardie. 2018. Harvesting Paragraph-level Question-AnswerPairs from Wikipedia. In
Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) . 1907β1917.[8] Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to Ask: Neural QuestionGeneration for Reading Comprehension. In
Proceedings of the 55th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers) . 1342β1352.[9] David Golub, Po-Sen Huang, Xiaodong He, and Li Deng. 2017. Two-Stage Synthe-sis Networks for Transfer Learning in Machine Comprehension. In
Proceedingsof the 2017 Conference on Empirical Methods in Natural Language Processing .835β844.[10] Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, YizhongWang, Hua Wu, Qiaoqiao She, et al. 2018. DuReader: a Chinese Machine ReadingComprehension Dataset from Real-world Applications. In
Proceedings of theWorkshop on Machine Reading for Question Answering . 37β46.[11] Michael Heilman and Noah A Smith. 2010. Good question! statistical rankingfor question generation. In
Human Language Technologies: The 2010 AnnualConference of the North American Chapter of the Association for ComputationalLinguistics . 609β617.[12] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, WillKay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read andcomprehend. In
Advances in neural information processing systems . 1693β1701.[13] Sepp Hochreiter and JΓΌrgen Schmidhuber. 1997. Long short-term memory.
Neuralcomputation
9, 8 (1997), 1735β1780.[14] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models forsequence tagging. arXiv preprint arXiv:1508.01991 (2015).[15] Sathish Reddy Indurthi, Dinesh Raghu, Mitesh M Khapra, and Sachindra Joshi.2017. Generating natural language question-answer pairs from a knowledgegraph using a RNN based question generation model. In
Proceedings of the 15thConference of the European Chapter of the Association for Computational Linguis-tics: Volume 1, Long Papers . 376β385.[16] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA:A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen-sion. In
Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) . 1601β1611.[17] Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Kyomin Jung. 2019. Improvingneural question generation using answer separation. In
Proceedings of the AAAIConference on Artificial Intelligence , Vol. 33. 6602β6609.[18] Kalpesh Krishna and Mohit Iyyer. 2019. Generating Question-Answer Hierarchies.In
Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics . 2321β2334.[19] Igor Labutov, Sumit Basu, and Lucy Vanderwende. 2015. Deep questions withoutdeep understanding. In
Proceedings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference on NaturalLanguage Processing (Volume 1: Long Papers) . 889β898.[20] Dong Bok Lee, Seanie Lee, Woo Tae Jeong, Donghwan Kim, and Sung JuHwang. 2020. Generating Diverse and Consistent QA pairs from Contextswith Information-Maximizing Hierarchical Conditional VAEs. arXiv preprintarXiv:2005.13837 (2020).[21] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, AbdelrahmanMohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART:Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,Translation, and Comprehension. In
Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics . 7871β7880.[22] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries.In
Text summarization branches out . 74β81.[23] Bang Liu, Haojie Wei, Di Niu, Haolan Chen, and Yancheng He. 2020. AskingQuestions the Human Way: Scalable Question-Answer Generation from TextCorpus. In
Proceedings of The Web Conference 2020 . 2032β2043.[24] Liangming Pan, Wenqiang Lei, Tat-Seng Chua, and Min-Yen Kan. 2019. Recentadvances in neural question generation. arXiv preprint arXiv:1905.08949 (2019).[25] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: amethod for automatic evaluation of machine translation. In
Proceedings of the40th annual meeting of the Association for Computational Linguistics . 311β318.[26] Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen,Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting Future N-gramfor Sequence-to-Sequence Pre-training. In
Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing: Findings . 2401β2410.[27] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limitsof Transfer Learning with a Unified Text-to-Text Transformer.
Journal of MachineLearning Research
21 (2020), 1β67. [28] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.SQuAD: 100,000+ Questions for Machine Comprehension of Text. In
Proceed-ings of the 2016 Conference on Empirical Methods in Natural Language Processing .2383β2392.[29] Kazutoshi Shinoda and Akiko Aizawa. 2020. Variational Question-Answer PairGeneration for Machine Reading Comprehension. arXiv preprint arXiv:2004.03238 (2020).[30] Linfeng Song, Zhiguo Wang, and Wael Hamza. 2017. A unified query-basedgenerative model for question generation and question answering. arXiv preprintarXiv:1709.01058 (2017).[31] Sandeep Subramanian, Tong Wang, Xingdi Yuan, Saizheng Zhang, AdamTrischler, and Yoshua Bengio. 2018. Neural Models for Key Phrase Extractionand Question Generation. In
Proceedings of the Workshop on Machine Reading forQuestion Answering . 78β88.[32] Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma, and Shi Wang. 2018.Answer-focused and position-aware neural question generation. In
Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Processing .3930β3939.[33] Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Questionanswering and question generation as dual tasks. arXiv preprint arXiv:1706.02027 (2017).[34] Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni,Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehensiondataset. arXiv preprint arXiv:1611.09830 (2016).[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Εukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information processing systems . 5998β6008.[36] Siyuan Wang, Zhongyu Wei, Zhihao Fan, Yang Liu, and Xuanjing Huang. 2019.A multi-agent communication framework for question-worthy phrase extractionand question generation. In
Proceedings of the AAAI Conference on ArtificialIntelligence , Vol. 33. 7168β7175.[37] Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. A joint model for questionanswering and question generation. arXiv preprint arXiv:1706.01450 (2017).[38] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William Cohen. 2017. Semi-Supervised QA with Generative Domain-Adaptive Nets. In
Proceedings of the55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) . 1040β1050.[39] Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman,Saizheng Zhang, Sandeep Subramanian, and Adam Trischler. 2017. MachineComprehension by Text-to-Text Neural Question Generation. In
Proceedings ofthe 2nd Workshop on Representation Learning for NLP . 15β25.[40] Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-levelneural question generation with maxout pointer and gated self-attention net-works. In