Multi-span Style Extraction for Generative Reading Comprehension
MMulti-span Style Extraction for Generative Reading Comprehension
Junjie Yang , Zhuosheng Zhang
Hai Zhao ∗ , SJTU-ParisTech Elite Institute of Technology, Shanghai Jiao Tong University, Shanghai, China Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interactionand Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China [email protected], [email protected], [email protected]
Abstract
Generative machine reading comprehension(MRC) requires a model to generate well-formed answers. For this type of MRC, answergeneration method is crucial to the model per-formance. However, generative models, whichare supposed to be the right model for thetask, in generally perform poorly. At the sametime, single-span extraction models have beenproven effective for extractive MRC, wherethe answer is constrained to a single span in thepassage. Nevertheless, they generally sufferfrom generating incomplete answers or intro-ducing redundant words when applied to thegenerative MRC. Thus, we extend the single-span extraction method to multi-span, propos-ing a new framework which enables genera-tive MRC to be smoothly solved as multi-spanextraction. Thorough experiments demon-strate that this novel approach can alleviatethe dilemma between generative models andsingle-span models and produce answers withbetter-formed syntax and semantics. We willopen-source our code for the research commu-nity.
Machine Reading Comprehension (MRC) is con-sidered as a nontrivial challenge in natural languageunderstanding. Recently, we have seen continuoussuccess in this area, partially benefiting from the re-lease of massive and well-annotated datasets fromboth academic (Rajpurkar et al., 2018; Reddy et al.,2019) and industry (Bajaj et al., 2018; He et al.,2018) communities.The widely used span-extraction models (Seoet al., 2017; Ohsugi et al., 2019; Lan et al., 2020),formulate the MRC task as a process of predicting ∗ Corresponding author. This paper was partially supportedby National Key Research and Development Program of China(No. 2017YFB0304100) and Key Projects of National NaturalScience Foundation of China (U1836222 and 61733011).
Figure 1: Example of how a well-formed answer is gen-erated by the multi-span style extraction. the start and end position of the span inside thegiven passage. They have been proven effective onthe tasks which constrain the answer to be an exactspan in the passage (Rajpurkar et al., 2018). How-ever, for generative MRC tasks whose answers arehighly abstractive, the single-span extraction basedmethods can easily suffer from incomplete answersor redundant words problem. Thus, there still existsa large gap between the performance of single-spanextraction baselines and human performance.In the meantime, we have observed that utilizingmultiple spans appearing in the question and pas-sage to compose the well-formed answer could bea promising method to alleviate these drawbacks.Figure 1 shows how the mechanism of multi-spanstyle extraction works for an example from theMS MARCO task (Bajaj et al., 2018), where thewell-formed answer cannot simply be extracted asa single span from the input text.Therefore, in this work, we propose a novel an-swer generation approach that takes advantage ofthe effectiveness of span extraction and the concisespirit of multi-span style to synthesize the free-formed answer, together with a framework as awhole for the multi-passage generative MRC. Wecall our framework MUSST for MU lti- S pan ST yle a r X i v : . [ c s . C L ] S e p igure 2: Our framework MUSST extraction. Our framework is also empowered bywell pre-trained language model as encoder compo-nent of our model. It provides deep understandingof both the input passage and question, and modelsthe information interaction between them. We con-duct a series of experiments and the correspondingablations on the MS MARCO v2.1 dataset.Our main contributions in this paper can be sum-marized as follows: • We propose a novel multi-span answer annotatorto transform the initial well-formed answer intoa series of spans that distribute in the questionand passage. • We generalize the single-span extraction basedmethod to the multi-span style by introducinga lightweight but powerful answer generator,which supports the extraction of various num-ber answer spans during prediction. • To make better usage of the large dataset for thepassage ranking task, we propose dynamic sam-pling during the training of the ranker that selectsthe passage most likely to entail the answer.
In this section, we present our proposed framework,MUSST, for multi-passage generative MRC task.Figure 2 depicts the general architecture of ourframework, which consists of a passage ranker,a multi-span answer annotator, and a question-answering module.
Given a question Q and a set of k candidate pas-sages P = { P , P , ..., P k } , the passage ranker is responsible for ranking the passages based ontheir relevance to the question. In other words,the model is requested to output conditional prob-ability distribution P ( y | Q, P ; θ ) , where θ is themodel parameters and P ( y = i | Q, P ; θ ) denotesthe probability that passage P i can be used to an-swer question Q . For each input question and passage pair ( Q, P i ) ,we represent it as a single packed sequence oflength n of the form “ [CLS] Q [SEP] P i [SEP] ”.We pass the whole sequence into a contextualizedencoder, thereby to produce its contextualized rep-resentation E ∈ R n × h where h denotes the hid-den size of the Transformer blocks. Following thefine-tuning strategy of Devlin et al. (2019) for theclassification task, we consider the final hidden vec-tor c ∈ R h corresponding to the first input token( [CLS] ) as the input’s aggregate representation.Our encoder also models the interaction betweenthe question and the passage. The ranker is responsible for ranking the passagesbased on its relevance to the question. Given theoutput of the encoding layer c , we pass it through afully connected multi-layer perceptron which con-sists of two linear transformations with a Tanh acti-vation in between: s = softmax ( W tanh( W c + b ) + b ) ∈ R u i = s and r i = s where W ∈ R h × h , W ∈ R × h , b ∈ R h and b ∈ R are trainable parameters. Here, r i and u i are respectively the relevance and unrelevancecore for the pair ( Q, P i ) . The relevance scores areconsequently normalized across all the candidatespassages of the same question: ˆ r i = exp ( r i ) (cid:80) kj =0 exp ( r j ) Here, ˆ r i indicates the probability that passage P i entails the answer Q . We define the question-passage pair where the pas-sage entails the question as a positive training sam-ple. The positive passage is noted as P + . Duringthe training phase, we adopt a negative samplingwith one negative sample. Specifically, for eachpositive instance ( Q, P + ) , we randomly samplea negative passage P − from the unselected pas-sages of the same question. The model is trainedby minimizing the following cost function: J ( θ ) = − T T (cid:88) t =1 log( r ( Q t , P + t ))+log( u ( Q t , P − t )) where T is the number of questions in the train-ing set, r ( Q t , P + t ) denotes the relevance score of ( Q t , P + t ) and u ( Q t , P − t ) denotes the unrelevancescore of ( Q t , P − t ) .Moreover, motivated by Liu et al. (2019), weresample the negative training instances at the be-ginning of each training epoch, to avoid using thesame training pattern for the question during eachtraining epoch. We name it dynamic sampling . In this section, we introduce our syntactic multi-span answer annotator. Before the training of ourquestion-answering module, we need to extractnon-overlapped spans from the question and pas-sage based on the original answer from the trainingdataset. Our annotator is responsible for transform-ing the original answer phrase into multiple spansthat distribute in the question and passage with sub-ject to syntactic constraints. The attempt to extractthe answer spans syntactically is motivated by ourfirst intuition that the human editors compose theoriginal answer in an analogous way.As shown in the middle of Figure 2, we trans-form the answer phrase into a parsing tree and tra-verse the parsing tree in a DFS (Depth-first search)way. At each visit of the subtree, we check if thespan represented by the subtree appears in the ques-tion or passage text. We obtain a span list after traversing the whole parsing tree. However, insome cases, the original answer still cannot be per-fectly composed by the words from the input texteven in a multi-span style. We get rid of these bad samples by comparing their edit distances with athreshold value which is set by the model before-hand.
Algorithm 1
Syntactic Multi-span Answer Anno-tation
Input : Question Q = { q , q , . . . , q m } , pas-sage P = { p , p , . . . , p n } and gold answer A = { a , a , . . . , a k } Parameter : Edit distance threshold d max Output : A list of start and end position of answerspans in the question and passage Let M be an empty list Pack question Q and passage P into a singlesequence C in a certain way. Get the syntactic parsing tree T of gold answer A by a constituency parser. Let S be the stack of subtrees to be traversed. Initialize S with the root R of the tree T while S is not empty do let V = P OP ( S ) Get a list of all the leaves of subtree V : L = { l , l , · · · , l n } if L is a sublist of C then Get the start index s and end index e of L in C by Knuth-Morris-Pratt pattern search-ing algorithm Add ( s , e ) into the span position list M else for childtree U in V (From right to left) do P USH ( S , U ) end for end if end while Reconstruct answer A (cid:48) from span position list M Let d = E DIT D ISTANCE ( A , A (cid:48) ) if d > d max then Empty the list M end if M ∗ = P RUNING ( M ) return M ∗ An important final step is to prune the answerspan list. The pruning procedure sticks to thefollowing principle: if two spans adjoint in thelist are contiguous in the original text, we jointhem together. Pruning reduces heavily the numberof spans needed to recover to the original answerphrase. The more comprehensive detail of our an-notator is described in Algorithm 1.
Given a question Q and a passage P , the question-answering module is requested to answer the ques-tion based on the information provided by the pas-sage. In other words, the model outputs the con-ditional probability distribution P ( y | Q, P ) , where P ( y = A | Q, P ) denotes the probability that A isthe answer. The architecture of the reader is analogous to the en-coder module of the ranker in section 2.1.2, wherewe take a pre-trained language model as encoder.But instead of getting only the aggregate represen-tation, we pass the whole output of the last layer topredict the answer spans as the follows: M = Encoder ( Q, P ) ∈ R h × n where n is the length of the input token sequence,and h is the hidden size of the encoder. Our answer generator is responsible for composingthe answer in a multi-span style extraction. Let n be the number of span to be extracted.For each single span prediction, we treat it as thesingle span extraction MRC task. Following Lanet al. (2020), we adopt a linear layer to predict startand end positions of the span in the input sequence.It is worth noticing that our model is also enabledto predict the answer span from the question. Theprobability distribution of i -th span’s start positionover the input tokens is obtained by: ˆ p j, start = softmax ( W sj M + b sj ) where W sj ∈ R × h and b sj ∈ R are trainable pa-rameters and ˆ p j, start k denote the probability of token k being the start of the answer span j . The end po-sition distribution of the answer span j is obtainedby using the analogous formula: ˆ p j, end = softmax ( W ej M + b ej ) During training, we add a special virtual span, withstart and end position values equaling the length ofthe input sequence, at the end of the annotated an-swer span list. This approach enables our model togenerate a various number of answer spans duringprediction with the virtual span serving as a stopsymbol. The cost function is defined as follows: J ( θ ) = − T T (cid:88) t =1 m t (cid:88) j =1 log( ˆ p j, start y j, start t ) + log( ˆ p j, end y j, end t ) where T is the number of training samples, m t isthe number of answer span for sample t , y j, start t and y j, end t are the true start and end position of the t -thsample’s j -th span.During inference, at each time step j , we choosethe answer span ( k, l ) where k < l with the max-imum value of ˆ p j, start k ˆ p j, end l . The decoding proce-dure terminates when the stop span is predicted.Sometimes, the model tends to generate repeatedlythe same spans. In order to alleviate the repeatingproblem, at each prediction time step j , we maskout the predicted span positions of previous timesteps ( < j ) during the calculation of probabilitydistribution of new start and end positions. Sincethe masking depends on the previously predictedspans, we name it as conditional masking . Theextracted spans are later joined together to form afinal answer phrase. We evaluate our framework on the MS MARCOv2.1 (Bajaj et al., 2018), which is a large scaleopen-domain generative task. MS MARCO v2.1provides two MRC tasks: Question Answering( QA ) and Natural Langauge Generation ( NLG ).The statistics of the corresponding datasets’ sizeare presented in Table 1. Both datasets consist ofsampled questions from Bing’s search logs, andeach question is accompanied by an average of tenpassages that may contain the answers. QA and N LG are subsets of
ALL , which also contains theunanswerable questions.Distinguished with the QA task, the NLG taskrequires the model to provide the well-formed an-swer, which could be read and understood by a nat-ural speaker without any additional context. There- The datasets can be obtained from the official site( https://microsoft.github.io/msmarco/ ) ataset Train Dev Test ALL QA N LG
Table 1: Statistics of MS MARCO v2.1 dataset. Thenumbers in parenthesis indicate the percentage of ex-amples whose answer is single span in gold passage. fore NLG-style answers are more abstract than theQA-style answers. Table 1 shows also the percent-age of examples where the answer can be extractedas a single span in the gold passage. Unsurpris-ingly, the answers from the QA set are much morelikely to match a span in the passage than the onesin the N LG set. Moreover, Nishida et al. (2019)states that the QA task prefers the answer to bemore concise than in the NLG task, averaging 13.1words, while the latter one averages 16.6 words.Therefore, the
N LG set is more suitable to evalu-ate model performance on generative MRC.BLEU-1 (Papineni et al., 2002) and ROUGE-L(Lin, 2004) are adopted as the official evaluation metrics to evaluate model performance, while theofficial leaderboard chooses ROUGE-L as the mainmetric. In the meantime, we use Mean Average Pre-cision (MAP) and Mean Reciprocal Rank (MRR)for our ranker. We compare our MUSST with the following base-line models: single-span extraction and seq2seq.For the single-span extraction baseline, we employthe model for the SQuAD dataset from ALBERT(Lan et al., 2020). The model is trained only withsamples where the answer is a single span in thepassage. In the meantime, We adopt the Trans-former model from Vaswani et al. (2017) as ourseq2seq baseline. For a fair comparison, the base-line models share the same passage ranker as theone in MUSST.
For the multi-span answer annotation, we use con-stituency parser from Standford CoreNLP (Man-ning et al., 2014). NLTK (Bird et al., 2009) pack-age is also used to implement our annotator. The The official evaluation scripts can be foundin https://github.com/microsoft/MSMARCO-Question-Answering/tree/master/Evaluation
Model
QA N LG
ROUGE-L BLEU-1 ROUGE-L BLEU-1
Single-span 47.96
Table 2: Performance comparison with our baselineson the QA and N LG development set. Here, we usethe same single ranker for MUSST and the baselines. maximum edit distance between the answer recon-structed from the annotated spans, and the originalanswer is 32 and 8 respectively for the
N LG and QA training sets.The ranker and question-answering module ofMUSST are implemented with PyTorch (Paszkeet al., 2019) and Transformers package (Wolf et al.,2020). We adopt ALBERT (Lan et al., 2020) asthe encoder in our models and initialize it withthe pre-trained weights before the fine-tuning. Wechoose ALBERT-base as the encoder of passageranker and ALBERT-xlarge instead for questionanswering module.Following Lan et al. (2020), we use Sentence-Piece (Kudo and Richardson, 2018) to tokenize ourinputs with a vocabulary size of 30,000. We adoptAdam optimizer (Kingma and Ba, 2015) to mini-mize the cost function. Two types of regularizationmethods during training: dropout and L2 weight de-cay. Hyperparameter details for the training of thedifferent models of our framework are presented inAppendices. MUSST-NLG and MUSST-QA aretrained respectively on the
N LG and QA sub-sets. The maximum number of spans for them isset to 9 and 5, respectively.The single-span baseline is implemented withthe same packages as MUSST while the seq2seqbaseline is implemented with Fairseq (Ott et al.,2019).
Table 2 shows the results of our single modeland the baseline models on the QA and N LG development datasets. MUSST outperforms sig-nificantly the baselines including the generativeseq2seq model over the
N LG set in terms of bothROUGE-L and BLEU-1. Even on the QA set, ourmodel yields better results regarding ROUGE-L.Table 3 compares our model performance with thecompeting models on the leaderboard. Althoughour model utilizes only a standalone classifier for odel Answer Generation Ranking NLG Task QA Task Overall AverageR-L B-1 R-L B-1 Human – – 63.2 53.0 53.9 48.5 54.65
Unpublished
PALM Unknown
Multi-doc Enriched BERT Unknown 32.5 37.7
Published
BiDAF a ♠ Single-span Confidence score 16.9 9.3 24.0 10.6 15.20ConZNet b ♠ Pointer-Generator Unkonwn 42.1 38.6 – – –VNET c ♠ Single-span Answer verification 48.4 46.8 51.6 54.3
Deep Cascade QA d ♠ Single-span Cascade 35.1 37.4 52.0 54.6 44.78Masque QA e † Pointer-Generator Joint trained classifier 28.5 39.9 52.2 43.7 41.08Masque NLG e † Pointer-Generator Joint trained classifier 49.6
MUSST-NLG † Multi-span Standalone classifier 48.0 45.8 49.0 51.6 48.60
Table 3: The performance of our framework and competing models on the MS MARCO v2.1 test set. All theresults presented here reflect the MS MARCO leaderboard ( microsoft.github.io/msmarco/ ) as of 28 May2020. ♠ refers to the model whose results are not reported in the original published paper. BiDAF for MARCO isimplemented by the official MS MARCO Team. † refers to the ensemble submission. Whether the other competingmodels are ensemble or not is unclear. a Seo et al. (2017); b Indurthi et al. (2018); c Wang et al. (2018b); d Yanet al. (2019); e Nishida et al. (2019).
Model ROUGE-L BLEU-1
MUSST 66.24 64.23w/o pruning 64.66 60.36w/o conditional masking 65.50 64.31MUSST w gold passage 75.39 74.41
Table 4: Ablation study on the
N LG development set. passage ranking, multi-span style extraction stillhelps us rival with state-of-the-art approaches.
We perform ablation experiments that quantify theindividual contribution of the design choices ofMUSST. Table 4 shows the results on the
N LG development set. Both pruning and conditionalmasking contribute the model performance, whichindicates that pruning can help the model to con-verge more easily by reducing the number of spans,while conditional masking can better generate an-swer without suffering from the repeating problem.We also observe using the gold passage can sig-nificantly improve question-answering. It showsthere still exists a great improvement space for thepassager ranker.
On the
N LG development set, we evaluate theanswers generated by our syntactic multi-span an- notator. The results shows our annotated answerscan obtain 89.35 in BLEU-1 and 90.19 in ROUGE-L with the gold passages, which demonstrates theeffectiveness of our annotator. For MUSST, theresults are 74.41 and 75.39 respectively (in Table4). So there is still much room for improvementwith respect to the question-answering module.
Figure 3: Distribution of training samples of edit dis-tance less than 4 over annoted answer spans. Forthe purpose of better illustration, we filter the sampleswhich include more than 9 spans.
Figure 3 presents the distribution of span num-bers with edit distance less than 4 over the QA and N LG training sets after the annotation procedure.It is seen that most QA-style answers are only onespan, while the NLG-style answers distribute moreuniformly in the range of [1, 9].To better understand the effect of the maximumumber of spans to be generated in the answer gen-erator, we let it vary in the range of [2, 12] andconduct experiments on the
N LG set with our bestsingle passage ranker. The edit distance thresholdis set to be 8. The results are presented in Figure4. Generally, increasing the number of the spanwill augment the token coverage rate, thus yieldingbetter results. But the gain becomes less significantwhen the maximum number of span is already largeenough. From Figure 4, we can see that the resultsvary imperceptibly when the maximum number ofspans reaches 5. However, since each span only in-troduces 4k parameters, which is negligible beforethe encoder (60M), we still choose the maximumnumber to be 9, which corresponds to the best per-formance on the development set.
Figure 4: Effect of maximum number of spans.
Figure 5 shows the results of MUSST on
N LG development set for various edit distance threshold.Interestingly, it indicates that BLEU-1 is impactedmore heavily by the variation of edit distance thanROUGE-L. And setting the edit distance thresholdtoo large may damage the model performance byintroducing too many incomplete samples.
Figure 5: Effect of edit distance threshold.
Encoder Parameters ROUGE-L BLEU-1
ALBERT-base 12M 62.03 60.48ALBERT-large 18M 64.93 61.67ALBERT-xlarge 60M
Table 5: Effect of ALBERT encoder size.
Table 5 presents experimental results on ALBERTencoder with various model sizes. Unsurprisingly,the model yields stronger results as the encodergets larger.
Model Training set MAP MRR
Bing (initial ranking) - 34.62 35.00
MUSST (single) QA w/o dynamic sampling QA Table 6: The performance of ranker with various con-figurations on the QA development set. Table 6 presents our ranker performance in termsof MAP and MRR. The results show that dynamicsampling leads to slightly better results. : how long should a central air condi-tioner last Selected Passage :
10 to 20 years - sometimeslonger. You should have a service tech come outonce a year for a tune up. You wouldn’t run yourcar without regular maintenance and tune ups andyou shouldn’t run your a/c that way either - if youwant it to last as long as possible. Source(s): 20years working for a major manufacturer of centralheating and air conditioning.
Reference Answer : A Central air conditionerlasts for in between 10 and 20 years./ A centralair conditioner should last for 10 to 20 years.
Prediction (Baseline) :
10 to 20 years.
Prediction (MUSST) : a central air conditionershould last for 10 to 20 years. Table 7: A prediction example from the baseline andMUSST. The highlighted texts are the spans predictedby our model to compose the final answer phrase.
To have an intuitive observation of the predic-tion ability of MUSST, we show a prediction ex-ample on MS MARCO v2.1 from the baseline andUSST in Table 7. The comparison indicates thatour model could extract effectively useful spans,yielding more complete answer that can be under-stood independent of question and passage context.
Generative MRC is considered as a more chal-lenging task where answers are free-form human-generated text. More recently, we have seen anemerging wave of generative MRC tasks. MSMARCO (Bajaj et al., 2018) is a large scalereal-world reading comprehension dataset wherethe questions are the anonymized search queriesissued through Bing or Cortana. NarrativeQA(Koisk et al., 2018) is the first large-scale question-answering dataset on full-length books and moviescripts, requiring understanding the underlying nar-rative rather than relying on shallow pattern match-ing or salience. DuReader (He et al., 2018) is theChinese counterpart of MARCO but with longerdocuments and answers. CoQA (Reddy et al.,2019) is a conversational MRC dataset which con-tains free-form answers.The most earlier approaches tried to generatethe answer in a single-span extractive way (Tayet al., 2018b,a; Wang et al., 2018b; Yan et al.,2019; Ohsugi et al., 2019). The models using asingle-span extractive method show effectivenessfor the dataset where abstractive behavior of an-swers includes mostly small modifications to spansin the context (Ohsugi et al., 2019; Yatskar, 2019).Whereas, for the datasets with answers of deepabstraction, this method fails to yield promisingresults.The first attempt to generate the answer in agenerative way is to apply an RNN-based seq2seqattentional model to synthesize the answer, such asS-NET (Tan et al., 2018), where seq2seq learningwas first introduced by Sutskever et al. (2014) forthe machine translation.The most recent models adopt a hybrid neu-ral network Pointer-Generator (See et al., 2017)to generate answer, such as ConZNet (Indurthiet al., 2018), MHPGM (Bauer et al., 2018) andMasque (Nishida et al., 2019). Pointer-Generatorwas firtsly proposed for the abstractive text sum-marization, which can copy words from the sourcevia the pointer network while retaining the abil-ity to produce novel words through the generator.Different from ConZNet and MHPGM, Masque adopt a Transformer-based (Vaswani et al., 2017)Pointer-Generator, while the previeous ones utilize-ing GRU (Cho et al., 2014) or LSTM (Hochreiterand Schmidhuber, 1997).
For each question-answer pair, the Multi-passageMRC dataset contains more than one passage asthe reading context, such as SearchQA (Dunn et al.,2017), Triviaqa (Joshi et al., 2017), MS MARCO,and DuReader.Existing approaches designed specifically forMulti-passage MRC can be classified into two cate-gories: pipeline and end-to-end. Pipeline-basedmodels (Chen et al., 2017; Wang et al., 2018a;Clark and Gardner, 2018) adopt a ranker to firstrank all the passages based on its relevance tothe question and then utilize a question-answeringmodule to read the selected passages. The rankercan be based on traditional information retrievalmethods (BM25 or TF-IDF) or employ a neuralre-ranking model. End-to-end models (Wang et al.,2018b; Tan et al., 2018; Nishida et al., 2019) readall the provided passages at the same time, andproduce for each passage a candidate answer as-signed with a score which is consequently com-pared among passages to find the final answer. Pas-sage ranking and answer prediction are usuallyjointly done as multi-task learning. More recently,Yan et al. (2019) proposed a cascade learning modelto balance the effectiveness and efficiency of thetwo approaches mentioned above.
Employing the pre-trained language models hasbeen a common practice for tackling MRC. The ap-pearances of more elaborated architectures, largercorpora, and more well-designed pre-training ob-jectives speed up the achievement of new state-of-the-art in MRC (Devlin et al., 2019; Liu et al., 2019;Lan et al., 2020). Moreover, Glass et al. (2019)adopts span selection, a MRC task, as an auxiliarypre-training task. Another mainstream line of re-search attempts to drive the improvements duringthe fine-tuning, which includes integrating betterverification strategies for unanswerable question(Zhang et al., 2020), leveraging external knowledgefor commonsense reasoning (Yang et al., 2019; Linet al., 2019) or cooperating with a graph networkfor multi-hop reading comprehension (Qiu et al.,2019; Ding et al., 2019).
Conclusion
In this work, we present a novel solution to genera-tive MRC, multi-span style extraction framework(MUSST), and show it is capable of alleviating theincompletion and abundant problems when gener-ating an answer. We apply our model to a challeng-ing abstractive MRC dataset MS MARCO v2.1 andsignificantly outperform the single-span extractionbaseline. This work indicates a new research linefor generative MRC in addition to the existing twomethods, one-span extraction and sequence genera-tion. With the support of only a standalone rankingclassifier, our proposed method still gives an overallperformance approaching state-of-the-art, showinggreat potential.
References
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen,Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Ti-wary, and Tong Wang. 2018. MS MARCO: A Hu-man Generated MAchine Reading COmprehensionDataset. arXiv preprint arXiv:1611.09268 .Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for Generative Multi-Hop QuestionAnswering Tasks. In
Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 4220–4230.Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural language processing with Python: analyz-ing text with the natural language toolkit . O’ReillyMedia, Inc.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In
Association for Computa-tional Linguistics (ACL) , pages 1870–1879.Kyunghyun Cho, Bart van Merrinboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. LearningPhrase Representations using RNN EncoderDecoderfor Statistical Machine Translation. In
EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 1724–1734.Christopher Clark and Matt Gardner. 2018. Simple andEffective Multi-Paragraph Reading Comprehension.In
Association for Computational Linguistics (ACL) ,pages 845–855.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In
North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NACCL-HLT) , pages 4171–4186. Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,and Jie Tang. 2019. Cognitive Graph for Multi-HopReading Comprehension at Scale. In
Associationfor Computational Linguistics (ACL) , pages 2694–2703.Matthew Dunn, Levent Sagun, Mike Higgins, V. UgurGuney, Volkan Cirik, and Kyunghyun Cho. 2017.SearchQA: A New Q&A Dataset Augmented withContext from a Search Engine. arXiv preprintarXiv:1704.05179 .Michael Glass, Alfio Gliozzo, Rishav Chakravarti, An-thony Ferritto, Lin Pan, G. P. Shrivatsa Bhargav, Di-nesh Garg, and Avirup Sil. 2019. Span SelectionPre-training for Question Answering. arXiv preprintarXiv:1909.04120 .Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao,Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu,Qiaoqiao She, Xuan Liu, Tian Wu, and HaifengWang. 2018. DuReader: a Chinese Machine Read-ing Comprehension Dataset from Real-world Appli-cations. In
Proceedings of the Workshop on Ma-chine Reading for Question Answering , pages 37–46.Sepp Hochreiter and Jrgen Schmidhuber. 1997. LongShort-Term Memory.
Neural Comput. , 9(8):1735–1780. Place: Cambridge, MA, USA Publisher: MITPress.Sathish Reddy Indurthi, Seunghak Yu, Seohyun Back,and Heriberto Cuayhuitl. 2018. Cut to the Chase: AContext Zoom-in Network for Reading Comprehen-sion. In
Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 570–575.Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A Large Scale Dis-tantly Supervised Challenge Dataset for ReadingComprehension. In
Association for ComputationalLinguistics (ACL) , pages 1601–1611.Diederik P. Kingma and Jimmy Ba. 2015. Adam:A Method for Stochastic Optimization. In
Inter-national Conference on Learning Representations(ICLR) .Tom Koisk, Jonathan Schwarz, Phil Blunsom, ChrisDyer, Karl Moritz Hermann, Gbor Melis, and Ed-ward Grefenstette. 2018. The NarrativeQA Read-ing Comprehension Challenge.
Transactions of theAssociation for Computational Linguistics (TACL) ,6:317–328.Taku Kudo and John Richardson. 2018. SentencePiece:A simple and language independent subword tok-enizer and detokenizer for Neural Text Processing.In
Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing: Sys-tem Demonstrations , pages 66–71.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A Lite BERT for Self-supervisedearning of Language Representations. In
Inter-national Conference on Learning Representations(ICLR) .Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and XiangRen. 2019. KagNet: Knowledge-Aware Graph Net-works for Commonsense Reasoning. In
Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 2829–2839.Chin-Yew Lin. 2004. ROUGE: A Package for Auto-matic Evaluation of Summaries. In
Text Summariza-tion Branches Out , pages 74–81.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach. arXiv preprint arXiv:1907.11692 .Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP NaturalLanguage Processing Toolkit. In
Association forComputational Linguistics (ACL) System Demon-strations , pages 55–60.Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazu-toshi Shinoda, Atsushi Otsuka, Hisako Asano, andJunji Tomita. 2019. Multi-style Generative ReadingComprehension. In
Association for ComputationalLinguistics (ACL) , pages 2273–2284.Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida,Hisako Asano, and Junji Tomita. 2019. A Simplebut Effective Method to Incorporate Multi-turn Con-text with BERT for Conversational Machine Com-prehension. In
Proceedings of the First Workshopon NLP for Conversational AI , pages 11–17.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In
Proceedings ofNAACL-HLT 2019: Demonstrations .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Eval-uation of Machine Translation. In
Association forComputational Linguistics (ACL) , pages 311–318.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. PyTorch:An Imperative Style, High-Performance Deep Learn-ing Library. In
Advances in Neural Information Pro-cessing Systems (NIPS) , pages 8024–8035. Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, LeiLi, Weinan Zhang, and Yong Yu. 2019. Dynami-cally Fused Graph Network for Multi-hop Reason-ing. In
Association for Computational Linguistics(ACL) , pages 6140–6150.Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know What You Don’t Know: Unanswerable Ques-tions for SQuAD. In
Association for ComputationalLinguistics (ACL) , pages 784–789.Siva Reddy, Danqi Chen, and Christopher D. Manning.2019. CoQA: A Conversational Question Answer-ing Challenge.
Transactions of the Association forComputational Linguistics (TACL) , 7:249–266.Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get To The Point: Summarizationwith Pointer-Generator Networks. In
Associationfor Computational Linguistics (ACL) , pages 1073–1083.Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional AttentionFlow for Machine Comprehension. In
InternationalConference on Learning Representations (ICLR) .Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to Sequence Learning with Neural Net-works. In
Advances in Neural Information Process-ing Systems (NIPS) , pages 3104–3112.Chuanqi Tan, Furu Wei, Nan Yang, Bowen Du,Weifeng Lv, and Ming Zhou. 2018. S-Net: FromAnswer Extraction to Answer Synthesis for MachineReading Comprehension. In
Association for the Ad-vancement of Artificial Intelligence (AAAI) .Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018a.Multi-Granular Sequence Encoding via DilatedCompositional Units for Reading Comprehension.In
Empirical Methods in Natural Language Process-ing (EMNLP) , pages 2141–2151.Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian Su.2018b. Densely Connected Attention Propagationfor Reading Comprehension. In
Advances in Neu-ral Information Processing Systems (NIPS) , pages4906–4917.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In
Advances in Neural Information Pro-cessing Systems (NIPS) , pages 5998–6008.Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GerryTesauro, Bowen Zhou, and Jing Jiang. 2018a. R3:Reinforced Ranker-Reader for Open-Domain Ques-tion Answering. In
Association for the Advancementof Artificial Intelligence (AAAI) .Yizhong Wang, Kai Liu, Jing Liu, Wei He, Ya-juan Lyu, Hua Wu, Sujian Li, and Haifeng Wang.018b. Multi-Passage Machine Reading Compre-hension with Cross-Passage Answer Verification. In
Association for Computational Linguistics (ACL) ,pages 1918–1927.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rmi Louf, Morgan Funtowicz,and Jamie Brew. 2020. HuggingFace’s Transform-ers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771 .Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, ZhongzhouZhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, andHaiqing Chen. 2019. A Deep Cascade Model forMulti-Document Reading Comprehension. In
Asso-ciation for the Advancement of Artificial Intelligence(AAAI) , volume 33, pages 7354–7361.An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu,Hua Wu, Qiaoqiao She, and Sujian Li. 2019. En-hancing Pre-Trained Language Representations withRich Knowledge for Machine Reading Comprehen-sion. In
Association for Computational Linguistics(ACL) , pages 2346–2357.Mark Yatskar. 2019. A Qualitative Comparison ofCoQA, SQuAD 2.0 and QuAC. In
North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies (NACCL-HLT) , pages 2318–2323.Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020.Retrospective Reader for Machine Reading Compre-hension. arXiv preprint arXiv:2001.09694 . A Appendices
A.1 Training details
We trained the passage ranker and the question-answering module of MUSST-NLG on a ma-chine with four Tesla P40 GPUs. The question-answering module of MUSST-QA is trained witheight GeForce GTX 1080 Ti GPUs. It takes roughly9 hours to train the passage ranker. For the question-answering module in MUSST-NLG and MUSST-QA, the training time is about 10 hours and 17hours respectively. The full set of hyperparametersis listed in Table 8. yperparameter Ranker MUSST-QA MUSST-NLG
Learning rate 1e-5 3e-5 3e-5Learning rate decay Linear Linear LinearTraining epoch 3 3 5Warmup rate 0.1 0.1 0.1Adam (cid:15) − − − Adam β β0.999 0.999 0.999MSN 256 256 256Batch size 128 32 32Encoder dropout rate 0 0 0Classifier dropout rate 0.1 0.1 0.1Weight decay 0.01 0.01 0.01