[PDF] EfficientQA : a RoBERTa Based Phrase-Indexed Question-Answering System

Abstract

State-of-the-art extractive question answering models achieve superhuman performances on the SQuAD benchmark. Yet, they are unreasonably heavy and need expensive GPU computing to answer questions in a reasonable time. Thus, they cannot be used for real-world queries on hundreds of thousands of documents in the open-domain question answering paradigm. In this paper, we explore the possibility to transfer the natural language understanding of language models into dense vectors representing questions and answer candidates, in order to make the task of question-answering compatible with a simple nearest neighbor search task. This new model, that we call EfficientQA, takes advantage from the pair of sequences kind of input of BERT-based models to build meaningful dense representations of candidate answers. These latter are extracted from the context in a question-agnostic fashion. Our model achieves state-of-the-art results in Phrase-Indexed Question Answering (PIQA) beating the previous state-of-art by 1.3 points in exact-match and 1.4 points in f1-score. These results show that dense vectors are able to embed very rich semantic representations of sequences, although these ones were built from language models not originally trained for the use-case. Thus, in order to build more resource efficient NLP systems in the future, training language models that are better adapted to build dense representations of phrases is one of the possibilities.

Full PDF

EEfﬁcientQA : a RoBERTa Based Phrase-Indexed Question-AnsweringSystem

Soﬁan Chaybouti, Achraf Saghe, Aymen Shabou

DataLab Groupe, Crédit Agricole S.A

Montrouge, France [email protected] , {achraf.saghe, aymen.shabou}@credit-agricole-sa.fr Abstract

State-of-the-art extractive question answeringmodels achieve superhuman performances onthe SQuAD benchmark. Yet, they are un-reasonably heavy and need expensive GPUcomputing to answer questions in a reason-able time. Thus, they cannot be used forreal-world queries on hundreds of thousandsof documents in the open-domain questionanswering paradigm. In this paper, we ex-plore the possibility to transfer the natural lan-guage understanding of language models intodense vectors representing questions and an-swer candidates, in order to make the taskof question-answering compatible with a sim-ple nearest neighbor search task. This newmodel, that we call

EfﬁcientQA , takes advan-tage from the pair of sequences kind of inputof BERT-based models (Devlin et al., 2019)to build meaningful dense representations ofcandidate answers. These latter are extractedfrom the context in a question-agnostic fash-ion. Our model achieves state-of-the-art re-sults in

Phrase-Indexed Question Answering (PIQA) (Seo et al., 2018b) beating the previousstate-of-art (Seo et al., 2019) by 1.3 points inexact-match and 1.4 points in f1-score. Theseresults show that dense vectors are able to em-bed very rich semantic representations of se-quences, although these ones were built fromlanguage models not originally trained for theuse-case. Thus, in order to build more resourceefﬁcient NLP systems in the future, traininglanguage models that are better adapted tobuild dense representations of phrases is oneof the possibilities.

Question answering is the discipline which aims tobuild systems that automatically answer questions posed by humans in a natural language. In the ex-tractive question answering paradigm, the answersto a question are spans of text extracted from a sin-gle document. In the famous SQuAD benchmark(Rajpurkar et al., 2016) for instance, each answerlies in a paragraph from Wikipedia.In the open-domain setting, the answers are soughtin a large collection of texts such as the whole En-glish Wikipedia (Chen et al., 2017). State-of-the-art performances in usual Question Answering areachieved thanks to powerful and heavy pretrainedlanguage models that rely on sophisticated atten-tion mechanisms and hundreds of millions of pa-rameters. Attention mechanisms (Bahdanau et al.,2016) are key components of such systems sincethey allow building contextualized and question-aware representations of the words in the docu-ments and extract the span of text which is mostlikely the correct answer. These models are veryresource-demanding and need GPUs to be scalable.Thus, they seem unsuitable to the open-domain realuse cases, where the model has to be applied onhundreds of thousands of documents, even with amulti-GPU server.A ﬁrst approach to solve this issue would be ﬁrstapplying a ﬁlter based on a statistical algorithmlike tf-idf (Sparck Jones, 1988) vectors or

BM25 (Robertson and Jones, 1976) algorithm. Then, theheavy model is called on several dozens of para-graphs. This approach is still prohibitive with CPU-only resources for instance.(Seo et al., 2018b) introduces a new benchmark,called

Phrase-indexed Question Answering , whichadds a constraint to the usual extractive questionanswering task. Indeed, document and questionencoders are forced to be independent (ﬁgure 1). a r X i v : . [ c s . C L ] J a n irst, a document is processed in order to providea vector representation to each answer candidates,in an ofﬂine mode. Then, in the online step, thequery is processed to be mapped to its own vectorrepresentation. Hence, the answer to the query isobtained by retrieving the nearest candidate vec-tor to the query vector. The general form of suchapproach to solve the open-domain question an-swering could be reformulated as the following.First, all candidate answers from all documents ofthe database are indexed ofﬂine. Then, at infer-ence time, the question is encoded and the bestcandidate is retrieved by a simple nearest-neighborsearch. This way, the scalability challenge of QA-systems is improved, since a single pass forwardin the deep learning model is needed to encode thequestion instead of several ones (one per each doc-ument) in previous settings.In this paper, we propose a new algorithm to solvethe PIQA benchmark (ﬁgure 1) and to close the gapbetween classic QA models. Our approach takesadvantage of BERT-based models in two ways.First, it extracts the potential answer candidatesin a question-agnostic fashion. Secondly, it takestwo sequences as input to build powerful seman-tic representations of candidate answers. Finally,it trains a siamese network to map candidates an-swers and query in the same vector space.Our model performs well, beating DENSPI (Seoet al., 2019), the previous state-of-the-art onthe PIQA benchmark, by 1.4 points in f1-scoreand 1.3 points in exact-match, while being lessresource-demanding both in training and at infer-ence times. It requires indexing only a hundredanswer-candidates dense vectors per context andﬁnetuning a RoBERTa-based (Liu et al., 2019)model, while DENSPI uses a BERT-large model. Figure 1: The PIQA challenge from (Seo et al., 2018b)

The construction of vast Question Answeringdatasets, particularly the SQuAD benchmark (Ra-jpurkar et al., 2016), has led to end-to-end deeplearning models successfully solving this task, forinstance (Seo et al., 2018a) is one of the ﬁrst end-to-end model achieving impressive performances.More recently, the ﬁnetuning of powerful languagemodels like

BERT (Devlin et al., 2019) allowedachieving superhuman performances on this bench-mark. In

SpanBERT (Joshi et al., 2020), the pre-training task of the language model is masked spanprediction instead of masked word prediction tobe better adapted to the down-stream task of QAwhich consists in span extraction. All these modelsrely on the same paradigm : building query-awarevector representations of the words in the context.This fundamental idea makes these model unsuit-able to the Open-Domain setting. (Chen et al., 2017) introduced the Open-DomainQuestion Answering setting that aims to use theentire English Wikipedia as a knowledge sourceto answer factoid natural language questions. Thissetting brings the challenge of building systemsable to perform

Machine Reading Comprehension at scale.Most recent work explored the following pipelineto solve this task. First, documents of the datasetare indexed (or encoded) using either statisticalmethods like

BM25 or dense representations ofdocuments. Then, we retrieve a dozens of them bysimilarity search between documents and questions(Karpukhin et al., 2020). Finally, we apply a deeplearning model trained for machine reading com-prehension to ﬁnd the answer. This approach hasbeen developed in a number of papers (Chen et al.,2017), (Raison et al., 2018), (Min et al., 2018),(Wang et al., 2017), (Lee et al., 2018), (Yang et al.,2019). It takes advantage of the very powerful lan-guage models of SOTA but has the inconvenientof being resource-demanding. Moreover, its per-formances are capped by the capabilities of thedocuments retrieval step of the pipeline.2 .3 the PIQA Challenge (Seo et al., 2018b) introduced the

Phrase-IndexedQuestion Answering (PIQA) benchmark in orderto make machine reading comprehension scalable.This benchmark enforces independent encoding ofquestion and document answer candidates in orderto reduce the Question Answering task to a simplesimilarity search task. Closing the gap betweensuch systems and very powerful models relyingon query-aware context representation would be agreat step towards solving the open-domain ques-tion answering scalability challenge. The baselinesproposed use LSTM-encoders trained in an end-to-end fashion. While achieving encouraging results,the performances are far from state-of-the-art atten-tion based models.

DENSPI (Seo et al., 2019) is the current state-of-the-art on the PIQA benchmark. This system usesthe BERT-large language model to train a siamesenetwork able to encode questions and indexed an-swer candidates independently. To represent candi-date answers,

DENSPI builds dense representationsusing the start and the end positions of each indexphrase.

DENSPI is also evaluated on the SQuAD-open benchmark (Chen et al., 2017). While beingsigniﬁcantly faster than other systems, it needs tobe augmented by sparse representations of docu-ments to be on par with them in terms of perfor-mances.

Ocean-Q (Fang et al., 2020) proposes an interest-ing approach to solve both the PIQA and the Open-Domain QA benchmarks by building an ocean ofquestion-answer pairs using Question Generationand query-aware QA models. When a question isasked, the most similar question from the oceanis retrieved thanks to tokens similarity. This ap-proach avoids the question-encoding step whilebeing on par with previous models on the SQuAD-open benchmark and signiﬁcantly higher than thebaselines on the PIQA challenge.

In this section, the model and the algorithm to solvethe task are developed.

The problem tackled in this paper is

Phrase-indexed Question Answering . Vanilla Question Answering is the task of building systems able toanswer natural language questions with spans oftext lying in the documents (ﬁgure with example1). Formally, the goal is to design a function F mapping a question Q and a context C , both repre-sented by a sequence of tokens { q , q , ..., q n } and { c , c , ..., c m } respectively, to a subsequence of C as an answer A = { a , a , ..., a p } (eq. 1). F ( Q, C ) = A (1)In PIQA , F is constrained to be an argmax over aset of answer candidates { A , A , ..., A k } ( k sub-sequences of the context C ) of a similarity productbetween the encoding G ( Q ) ∈ R l of the questionand the encoding of each candidate H ( A i ) ∈ R l (eq. 2), where l is the encoding size. A = argmax A i G ( Q ) • H ( A i ) (2) The ﬁrst step toward building the system is to deﬁnethe set of answer candidates. A naive approachwould be to consider all possible spans in a givencontext C of length m . This would give m ( m +1)2 possible candidates, i.e. about candidates percontext, if we assume contexts are made of about500 tokens. Figure 2: Agnostic extraction of answer candidates

Only a limited amount of all possible spans arepotential answers to any question. Thus, we re-duce the set of candidates by training a Question-Agnostic Answer Candidates Extraction model (ﬁg-ure 2). Formally, the context C is mapped to theset of candidates { A , ..., A k } thanks to a function f . To do so, a Roberta Base (Liu et al., 2019) modelis ﬁnetuned taking the context as input and su-pervised by the answers provided in the SQuADdataset. To extract the candidates, we use a beamsearch algorithm : the s most likely candidate startsare ﬁrst extracted thanks to a dense layer, then, their3 igure 3: Agnostic extraction of answer candidates with beam search. Paragraph tokens are provided to thelanguage models to produce their embeddings, then a ﬁrst dense layer allows to identify the s most likely startpositions of candidates. The embeddings of the paragraph’s tokens are concatenated to each start position and asecond dense layer allows to identify the e most likely end positions associated to each start positions. We end upwith s × e possible spans. embeddings are concatenated to each context wordembeddings and fed into another dense layer toextract the e most likely candidate ends associatedto each start position as shown in ﬁgure 3. Thus,we end up processing s × e answer candidates. Ab-lation studies, developed in further sections, showthat feeding the start position embeddings whenextracting the end positions results gives better an-swer candidates. After deﬁning the set over which the argmax func-tion will be applied, we need to build the encodingfunctions for both the questions and the candidatesanswers. To this purpose, we ﬁnetune a

RobertaBase as a siamese network (Koch et al., 2015) sothat questions and candidates are mapped to thesame euclidean space (eq. 3). G (cid:39) H (3) To build powerful answer candidates representa-tions we take advantage from the pair of sequencestype of input of pretrained BERT based models.The context is provided as ﬁrst input and the candi-date is provided as second input of the encoder asshown in ﬁgure 4. Eventually, the embeddings ofeach tokens are passed through a dense layer andtheir ﬁnal embeddings are averaged to provide theencoding of the candidate.

To build its representation, the question is passedthrough the same network as the context-candidatepair, and the embeddings of all tokens are averagedas shown in ﬁgure 5.

When training the Question-Agnostic CandidatesExtraction model, we use the cross-entropy lossover start and end positions, just like most of deep4 igure 4: Answer candidate dense vectors. The para-graph’s tokens and the candidate’s tokens separatedby special

SEP token are provided to the languagemodel. The ﬁnal embeddings are provided to an ad-ditional dense layer and averaged to produce the candi-date dense representation. neural networks trained for vanilla Question An-swering but without adding the question informa-tion as described in eq.4. L ( C ; Θ ) = − log ( P ( s ∗ ; Θ )) − log ( P ( e ∗ ; Θ )) (4) To train the siamese network to build the ques-tions’ and candidates’ representations, we use thecandidates extracted previously. When the correctanswer A ∗ is among these candidates, the loss de-scribed in eq.5, where Γ represents the parametersof the networks, is minimized. Figure 5: Question dense vectors. The question’s to-kens are passed to the same language model as thecandidates’ and the ﬁnal embeddings are passed to thesame dense layer. Eventually, the vectors are averagedto produce the question dense representation. L ( Q, A i ; Γ) = − H ( A ∗ ) • G ( Q )+ log ( (cid:88) i exp ( H ( A i ) • G ( Q )) (5) In this section, we present our experiments andresults.

SQuAD v1.1 (ﬁgure 6) (Rajpurkar et al., 2016)is a reading comprehension dataset consisting of100,000+ questions-answers pairs from Wikipedia5aragraphs. Our model was trained on the train set( pairs) and evaluated on the development set( pairs).

Figure 6: Question-answer pairs for a passage in theSQuAD dataset (ﬁgure taken from (Rajpurkar et al.,2016))

In recent years, efforts have been done for the de-mocratization of NLP powerful tools beyond theEnglish language. To this purpose, new datasets inother languages have been designed. The FrenchQuestion Answering Dataset, called FQuAD (ﬁg-ure 7)) is one of them (d’Hoffschmidt et al., 2020).FQuAD is a French Question Answering corpusbuilt from 326 Wikipedia articles which train anddevelopment sets consist of 20,731 and 5,668 ques-tion/answers pairs respectively.

To train the agnostic extraction model, we used alearning rate of 1e-4 with a batch size of 32 andAdamW (Loshchilov and Hutter, 2019) optimiza-tion algorithm.

To build the dataset to train and evaluate the model,we use our agnostic extraction model to retrieve60 candidates for each question-context pair. Each

Figure 7: Question-answers pair for a given paragraphin the FQuAD dataset. time, the good answer were present in the extractedset of candidates, the whole example is added tothe train set. To evaluate the model we extract 100candidates for each question-context pair.The training of the siamese network took approxi-mately 1 week for 5 epochs on a single 24GB GPUNVIDIA Quadro RTX 6000.We used a learning rate of 1e-5 with AdamW opti-mizer and a linear scheduler. We also used mixedprecision training (Micikevicius et al., 2018) toreduce time requirements and 8 steps of gradientaccumulation along with a batch size of 4 which isequivalent to a training batch size of 32.

In this section, we justify the architecture of theAnswer Candidates Extraction Model. Indeed, wemight have chosen a simpler architecture wherestart positions and end positions likelihoods arecomputed independently as shown in ﬁgure 8.While the two models show equivalent resultsin vanilla Question Answering, the architecture wehave chosen provides a much better set of candi-dates. Given the same number of selected candi-dates, the good answer is far more present as shownin table 1. The architectures are evaluated withexact-match and f1-score over all selected candi-6 igure 8: classic architecture for candidates extraction dates. For fair comparisons, we evaluate the classicarchitecture using both optimal decoding and beamsearch. When doing beam search, we use asbeam size for start positions and as beam sizefor end positions. We explain the differences inperformances by the fact that the dependent com-putations between start and end positions providebetter constituents that are more likely to be an-swers to questions. Indeed, the likelihood of acandidate is better modeled in this case : P ( s, e ) = P ( s ) × P ( e | s ) while in the classic architecture : P ( s, e ) = P ( s ) × P ( e ) model exact-match f1-scoreclassic architecture 65.1 80.3classic architecture 54.1 72.4with beam searchour architecture Table 1: Comparison between classic decoding andours for 100 extracted answer candidates

Table 2 shows the results obtained by various sys-tems on the PIQA challenge. We observe that Ef-ﬁcientQA beats previous sota DENSPI (+1.3 inexact-match and + 1.4 in f1-score) while the encod-ing method of the latter is based on large version ofBERT (340 million parameters) and ours is basedon RoBERTa-base (125 million parameters). Wecan explain these performances by the quality ofthe representations on the one hand and in the otherhand, by the fact that agnostic extraction drasticallyreduces the size of the set on which we are lookingfor the right answer. Hence, it leaves less room forerror.

EM F1 st baseline : LSTM + SA 49.0 59.8(Seo et al., 2018b) nd baseline : LSTM + SA + ELMO 52.7 62.7(Seo et al., 2018b)DENSPI 73.6 81.7(Seo et al., 2019)Ocean-Q 63.0 70.5(Fang et al., 2020)EfﬁcientQA RoBERTa (vanilla QA, our run) 83.0 90.4(Liu et al., 2019)

Table 2: Results on SQuAD v1.1

CamemBERT (Martin et al., 2020) is a pretrainedFrench language model based on the RoBERTaarchitecture. We use it to build the dense represen-tations of the French version of EfﬁcientQA. Table3 presents the results of EfﬁcientQA on the FQuADbenchmark. (d’Hoffschmidt et al., 2020) have ﬁne-tuned CamemBERT to perform vanilla QuestionAnswering on their dataset. The results show thatthe gap between EfﬁcientQA and ﬁnetuned modelsare closer in English than in French. This might beexplained by the volume of data signiﬁcantly lowerin French than in English.

In this paper, we introduced EfﬁcientQA, a phrase-indexed approach to solve question answering. Oursystem relies on question-agnostic extraction of7 odel exact-match f1-scoreEfﬁcientQA (

PIQA ) 64.4 76.0CamembertQA ( vanilla QA ,our run) 77.6 87.3(d’Hoffschmidt et al., 2020)

Table 3: Performances of EfﬁcientQA on the FQuADbenchmark candidates that allows to reduce drastically the setof possible answers and to take advantage from thepair of sequences type of input of RoBERTa-basepretrained language model. EfﬁcientQA achievesstate-of-the-art performances on the PIQA bench-mark and keep closing the gap with vanilla Ques-tion Answering models, while there is still room forfurther improvements by using heavier pretrainedlanguage models to build dense representations ofquestions and candidates. Future research will fo-cus on mobilizing the necessary resources to ex-tend EfﬁcientQA representations to index a wholecorpus such as the entire English Wikipedia andspeed-up open-domain question answering.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2016. Neural machine translation by jointlylearning to align and translate.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing.Martin d’Hoffschmidt, Wacim Belblidia, Tom Brendlé,Quentin Heinrich, and Maxime Vidal. 2020. Fquad:French question answering dataset.Yuwei Fang, Shuohang Wang, Zhe Gan, Siqi Sun, andJingjing Liu. 2020. Accelerating real-time questionanswering via question generation.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.Spanbert: Improving pre-training by representingand predicting spans.Vladimir Karpukhin, Barlas O˘guz, Sewon Min, PatrickLewis, Ledell Wu, Sergey Edunov, Danqi Chen, andWen tau Yih. 2020. Dense passage retrieval for open-domain question answering. Gregory Koch, Richard Zemel, and Ruslan Salakhut-dinov. 2015. Siamese neural networks for one-shotimage recognition.Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, MiyoungKo, and Jaewoo Kang. 2018. Ranking paragraphsfor improving answer recall in open-domain ques-tion answering.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach.Ilya Loshchilov and Frank Hutter. 2019. Decoupledweight decay regularization.Louis Martin, Benjamin Muller, Pedro Javier Or-tiz Suárez, Yoann Dupont, Laurent Romary, Éricde la Clergerie, Djamé Seddah, and Benoît Sagot.2020. Camembert: a tasty french language model.

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics .Paulius Micikevicius, Sharan Narang, Jonah Alben,Gregory Diamos, Erich Elsen, David Garcia, BorisGinsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, and Hao Wu. 2018. Mixed pre-cision training.Sewon Min, Victor Zhong, Richard Socher, and Caim-ing Xiong. 2018. Efﬁcient and robust question an-swering from minimal context over documents.Martin Raison, Pierre-Emmanuel Mazaré, RajarshiDas, and Antoine Bordes. 2018. Weaver: Deep co-encoding of questions and documents for machinereading.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text.S. E. Robertson and K. Sparck Jones. 1976. Relevanceweighting of search terms.

Journal of the AmericanSociety for Information Science , 27(3):129–146.Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2018a. Bidirectional attentionﬂow for machine comprehension.Minjoon Seo, Tom Kwiatkowski, Ankur P. Parikh, AliFarhadi, and Hannaneh Hajishirzi. 2018b. Phrase-indexed question answering: A new challenge forscalable document comprehension.Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, AnkurParikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019.Real-time open-domain question answering withdense-sparse phrase index. In

Proceedings of the , pages 4430–4441, Florence,Italy. Association for Computational Linguistics.Karen Sparck Jones. 1988. A Statistical Interpretationof Term Speciﬁcity and Its Application in Retrieval ,page 132–142. Taylor Graham Publishing, GBR.Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GeraldTesauro, Bowen Zhou, and Jing Jiang. 2017. R :Reinforced reader-ranker for open-domain questionanswering.Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, LuchenTan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.End-to-end open-domain question answering with. Proceedings of the 2019 Conference of the North ..