Role of Attentive History Selection in Conversational Information Seeking
RRole of Attentive History Selection in ConversationalInformation Seeking
Somil Gupta β [email protected] of Massachusetts Amherst Neeraj Sharma β [email protected] of Massachusetts Amherst ABSTRACT
The rise of intelligent assistant systems like Siri and Alexa haveled to the emergence of Conversational Search, a research trackof Information Retrieval (IR) that involves interactive and itera-tive information-seeking user-system dialog. Recently released OR-QuAC and TCAsT19 datasets narrow their research focus on theretrieval aspect of conversational search i.e. fetching the relevantdocuments (passages) from a large collection using the conversa-tional search history. Currently proposed models for these datasetsincorporate history in retrieval by appending the last π turns tothe current question before encoding. We propose to use anotherhistory selection approach that dynamically selects and weighshistory turns using the attention mechanism for question embed-ding. The novelty of our approach lies in experimenting with softattention-based history selection approach in open-retrieval setting. KEYWORDS
Conversational Information Seeking, Conversational Search, Con-versational History, Attention
ACM Reference Format:
Somil Gupta and Neeraj Sharma. 2021. Role of Attentive History Selectionin Conversational Information Seeking. In
Proceedings of ACM Conference(Conferenceβ17).
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Conversational Search is a long-standing goal of the InformationRetrieval (IR) community that envisions IR systems to be able toascertain and satisfy user information need iteratively and inter-actively. The emergence and subsequent popularity of intelligentassistant systems like Siri, Alexa, AliMe, Cortana, and Google As-sistant, has further fuelled research and investment in this domain.However, conversational search by itself envelopes numerous set-tings, each with their problems to tackle, so the researchers tendto narrow their research focus on specific sub-problems. One suchsetting can be "System Ask, User Respond" [15] where the systemcan ask user proactively to clarify their information need.A complementary setting involves the system responding withthe answer without any clarification questions. ConversationalQuestion Answering (ConvQA) [10] is an example of this settingwhere the system answers user questions using the given evidence.CoQA [12] and QuAC [1] are common ConvQA datasets. However, β Both authors contributed equally to this research.
Conferenceβ17, July 2017, Washington, DC, USA Β© 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn these datasets primarily focus on answer formulation given theevidence and ignore the retrieval aspect of the problem. Recentlyreleased open-retrieval datasets OR-QuAC [9] and TCAsT19 [2]focus on the retrieval aspect of the conversational information-seeking problem. The focus of our research will be on these open-retrieval conversational question answering models [9].History Modelling and Selection is an important aspect of con-versational search which dictates how the previous conversationalturns would be incorporated into the model while the model reasonson the current query. Different modeling and selection approachesare used across convQA models [5]. The current open-retrievalmodels append the last π turns of the conversation with the cur-rent question and expect the model to derive context. This assumesthat immediate turns have all the important contextual informa-tion for the current question. However, conversational dialogs mayhave topic shifts (the current question is not immediately relevantto something previously discussed), or topic returns (the currentquestion is asking about a previously shifted topic) [14]. Insteadwe explore another history selection technique suggested by Quet. al.[11] for ConvQA setting that employs attention mechanism tofind the relative importance of each turn to the current question.Such a technique does not make any assumption and lets the modelattend over all the history turns to decide which ones to include.Since history attention mechanism improved the results in Con-vQA setting, it may be worth experimenting within the retrievalsetting to identify the relative weightage of a history question whilepreparing the combined question embedding. The task is to retrieve a ranked list of passages in response to aquestion by using previous questions in conversation as context.Formally this problem can be defined as follows: Given a dialog π· with conversation turns or questions, π = { π , π , ..., π π } anda collection πΆ , the task is to retrieve a ranked list of π passages { π } π π = from πΆ , based on their relevance to each question π π β π using preceding context { π } π β π = weighted by { πΌ } π β π = where πΌ isthe attention weight computed for that turn. The soft attention πΌ π in this case would denote the relative co-reference of the currentquestion π π to the history question π π . We ignore the reader and theranker subsections in OR-QuAC implementation [9] as our primalfocus is to modify history selection for open-retrieval. Conversational Information Seeking:
In open retrieval settings,[8], [3], and [6] adopted a dual-encoder architecture to construct alearnable retriever and demonstrate that their methods are scalableto large collections, but these works are limited to single-turn QA a r X i v : . [ c s . I R ] F e b onferenceβ17, July 2017, Washington, DC, USA Somil Gupta and Neeraj Sharma settings. In [9], the authors illustrate the approaches for open-retrieval question-answer setting for multiple conversational turns.In an open-retrieval setting, the retrieval process is open in termsof retrieving from a large collection instead of re-ranking a smallnumber of passages in a closed set. Their model elaborately definesa retriever, a re-ranker, and a reader based on the Transformermodel [13]. History modeling is done by concatenating previousN history questions to the current question. Our approach woulduse their model as our underlying implementation, however, wewould disable the reader and ranker modules (only enable retrieval),and use attention mechanism for history selection in the retrievermodule. History selection:
Our attention based history selection ap-proach is inspired by the work by Qu et al. [11] who employ softattention mechanism to select and weigh history turns based onhow helpful they are in answering the current question in the Con-vQA setting. However, while they tried to use attentive historyselection in ConvQA (answer formulation), we will try to employit for retrieval to form a single query embedding for passages.
Figure 1: A high-level illustration of Fine-grained HistoryAttentive Retriever (HAR). Each OR-QuAC instance is con-verted into a instance-aware sub-batch of size equal to thenumber of turns in the instance, where each row of thisbatch contains the tokens of the current question π π , his-tory turn π π < π and CANARD rewritten first question π . Thisbatch is passed through an ALBERT query encoder to cre-ate sequence level contextualized representations ( πΆπΏπ π ) andtoken-level representations for each turn. Sequence-levelrepresentations of all history turns in the batch are used togenerate soft attention weights πΌ π < π using softmax-style at-tention. Weighted averaging at token-level across the batchon contextualized token representation followed by subse-quent averaging produces the dense question vector. In aseparate workflow, each passage in the collection πΆ is en-coded offline to produce dense vectors. A similarity function(in our case, dot product) can finally be used to obtain rele-vance scores per paragraph for the given question π π andhistory π π β π = . We use the same underlying architecture and training/evaluationprocess as in OR-ConvQA [9], using its model as our baseline andits code repository as our starting point. We add three majormodifications to the baseline model implementation.(1) We modify the baseline retriever to use soft attention mech-anism for history modelling into the current query duringconcurrent learning . Please note that no changes are madeduring retriever pre-training as it is trained on CANARD[4]rewrites instead of QuAC[1] questions and therefore hascomplete context in the query itself. We call this modifiedretriever model as History Attentive Retriever (HAR) .(2) We need to compute attention using the contextualized out-put of our query encoder for each history turn, requiringmodification of input batches to contain all history turnsinto a single batch. We, therefore, make changes to batchpreprocessing for retriever model and name this batchingprocess as instance-aware batching similar to HAM[11].(3) Since the focus of our experiment in on retrieval processin concurrent learning, we modify our model checkpointselection process during evaluation phase to use retriever-recall instead of reader-F1 score as the selection metric .A high-level description (and illustration) of the modified retrievalmodel is provided in Figure 1. Subsequent sub-sections discusssome important aspects of our approach.
The OR-QuAC retriever instance consists of the current question,the history turns (questions and answers for those questions). Foreach instance, OR-QuAC creates a sequence of current questionprepended by history questions for a fixed-window size , and after to-kenization, the input batch to the baseline retriever has dimensions ( π, π, π΅ ) where π is the number of retriever batch size per GPU, π is the max ALBERT sequence length and π΅ is token embeddingsize.However, HAR translates each OR-QuAC instance into a collec-tion of sequences for all history turns where each sequence containsthe first question (for complete context), turn question and the cur-rent question. Each of these sequences is an input to ALBERT andtheir contextualized output is recombined to generate the queryvector. In order to perform this computation together, we need toinclude all these sequences related to a single instance in a singlebatch. This batch is called instance-aware batch . A typical batchin baseline model would translate to ( π, πΌ, π, π΅ ) where πΌ addition-ally is the max number of history turns in the instance and π is thenumber of retriever batch size per GPU. For the sake of simplicity,we do not combine examples from multiple instances in the sameretriever GPU batch as implemented in HAM[11]. We use ALBERT [7] for contextualized question encoding and AL-BERT tokenizer for tokenization. Let π be the max number ofhistory turns, π be the max number of tokens in a question, π π be https://github.com/prdwb/orconvqa-release ole of Attentive History Selection in Conversational Information Seeking Conferenceβ17, July 2017, Washington, DC, USA the current question and π π < π be the π π‘β history question in history { π } ππ = . Then, I π = Tokenizer ( [CLS] π [SEP] π π [SEP] π π ) G i = ALBERT (I π , PosSeg [ π ]) where, β’ I π β R ( π + )Γ π is the input embeddings for ALBERT forturn π with π as the embedding size of the tokenizer. Here, I is the instance-aware batch. β’ PosSeg [ π ] β R ( π + )Γ π is the positional segment embed-ding for turn π (discussed in 4.3). β’ G i β R ( π + )Γ π is the contextualized output of ALBERT forturn π . β’ s i = G i [ πΆπΏπ ] is the contextualized sequential representationcorresponding to CLS token for turn π where s i β R π . β’ T i [ π ] = G i π π [ π ] is the contextualized token representationcorresponding to the π π‘β token of π π w.r.t. π π and π , where T i β R π Γ π for turn π .Now, taking d β R π as the attention learnable parameter, πΌ π = exp d π Β· s i (cid:205) ππ β² = exp d π Β· s i β² , where πΌ π β [ , ] where πΌ π is the attention weight of turn π over π π .Once we have the attention weights over the entire history turn,there are two ways to generate dense query vector ^ π π β R π . β’ Fine-grained history selection as depicted in Figure 1, weapply soft-attention at token level and average all the finaltoken representations, i.e. ^ π π = π βοΈ π = πΌ π T π ,where ^ π π β R π Γ π ^ π π = π π βοΈ π = ^ π π,π β’ Coarse-grained history selection: we can apply soft-attentionat sequence level using CLS representations of each turn, i.e. ^ π π = π βοΈ π = πΌ π s π Finally, retriever scores can be computed using the passage densevectors { π π } π½π = where π½ is the total number of candidate passagesobtained using Faiss index retriever as in the baseline model[9].retriever score ( ^ π π , π π ) = ^ π π Β· π π [Please note that the embedding dimension π is generally dif-ferent from the BERT hidden size β , due to which we have to useinput/output projection linear layers for R π Γ β = β R π Γ π , how-ever, we have ignored it for the sake of simplicity.] Qu et. al. [11] found that encoding relative position of the his-tory turn from the current turn during history modelling improvesmodel performance for ConvQA. They introduce an additionalpositional history answer embedding to augment BERT input em-beddings. In order to test this hypothesis for open-retrieval, we encode therelative turn information in segment input embeddings that are usedby ALBERT to differentiate different segments in a sequence [7].
We annotate tokens of current question π π with token type identifier and tokens of (first question π + turn history question π π ) with tokentype identifier π β π . The token type vocabulary size in ALBERTconfigurations is set to max history turns π . This helps to maphistory turn input segments to different segment embedding tokens. We use OR-QuAC [9] as our primary dataset for experiments. OR-QuAC enhances QuAC by adapting it to an open retrieval setting.It is an aggregation of three existing datasets: (1) the QuAC dataset[1] that offers information-seeking conversations, (2) the CANARDdataset [4] that consists of context-independent rewrites of QuACquestions, and (3) the Wikipedia corpus that serves as the knowl-edge source of answering questions. Data statistics of the OR-QuACdataset are mentioned in table 1. The dataset can be downloadedfrom the CIIR page . Table 1: Data Statistics of OR-QuAC dataset [9]
Items Train Dev Test
We compare our results against the OR-ConvQA system defined inthe OR-QuAC paper [9]. Model was run using the hyperparametersas defined in Table 2. Additionally, history window size was setto and only history questions were prepended alongwith thefirst question. Checkpoint at training steps was selectedas the best model using Reader-F1 metric during evaluation andcorresponding test results are reported in Table 3. We implemented our approach for HAR as defined in section 4 usingOR-QuAC repo as our underlying implementation. The retrievercheckpoint for pre-training and the passage representations werereused from OR-ConvQA repo . Both the fine-grained and coarse-grained versions of HAR were implemented and run with positionalSegment Embeddings enabled, using hyper-parameters as definedin Table 2 (same as the baseline for comparison). Models are trainedwith 2 NVIDIA TITAN X GPUs- 1 for training and another forFaiss indexing. Retriever recall is used as model selection metric https://ciir.cs.umass.edu/downloads/ORConvQA/ https://github.com/prdwb/orconvqa-release https://github.com/facebookresearch/faiss onferenceβ17, July 2017, Washington, DC, USA Somil Gupta and Neeraj Sharma during evaluation as the best model checkpoint is used for testing.Our implementation is located on Github . Table 2: Hyperparameter values for HAR modelβs concur-rent learning and evaluation.Hyperparameter Value
Max (seq, ques, passage, ans) length 512, 125, 384, 40Train, eval batch sizes per GPU 1,1Learning rate 5e-5Number of epochs 3.0Top k results for (retriever, reader) 100,5Max History Turns 11Max training iterations 90000
We used Mean Reciprocal Rank (MRR) and
Recall to evaluate theretrieval performance for the proposed history attention retriever(HAR). The reciprocal rank of a query is the inverse of the rank ofthe first positive passage in the retrieved passages.
MRR is the meanof the reciprocal ranks of all queries.
Recall is the fraction of thetotal amount of relevant passages that are retrieved. Table 5.2 pro-vides our results on OR-QuAC test data for best performing modelcheckpoints of HAR variations and baseline implementations. Table 3: Test Results of History Attentive Retriever(HAR)implementations along with the baseline.
Model MRR RecallOR-ConvQA baseline [9] 0.2166 0.3045HAR w/ fine-grained attention 0.1995 0.2742HAR w/ coarse-grained attention 0.1966 0.2812
We present an analysis of the test results presented in Table 3. β’ Our results show that
HAR with coarse-grained atten-tion performs better on Recall than fine-grained atten-tion . A possible hypothesis for this can be that ALBERT CLStoken is more appropriate for capturing abridged latent infor-mation about the complete query compared to token-basedrepresentation. Fine-grained representation may addition-ally be capturing low-level unessential details of the querywhich may not be relevant for query passage interaction.Also, we use mean to obtain a single dense query vector fromtoken-level query representation (obtained via soft attentionscores based weighted averaging of ALBERT outputs acrossconversation turns). However, mean might be phasing outimportant latent information or amplifying unimportant one https://github.com/somiltg/orconvqa-release https://umass.box.com/s/uya1t53h3qfro0x8i9k4joivcmsy08fa in its effort to grant equal weight to each token. A more elab-orate aggregation mechanism like a linear layer might bemore suitable. Our conclusion about fine-grained perform-ing better is also corroborated by OR-QuAC[9] results whichalso uses CLS based representation for query vector. While there is a noticeable difference in recall scores,the results on MRR are almost the same on both thevariants.
This suggests that the topmost relevant documentposition in the results is very much same in both variants. β’ Our HAR approach is performing comparatively de-cently to the baseline but is not able to beat the per-formance of the baseline with any of the variants . Onepossible reasoning may the inappropriate use of an aggre-gation function for fine-grained attention which may haveboosted the performance. Another reason may be the use ofpretrained retriever model used by the baseline instead ofrunning our own using attention approach, which might bebeneficial for our approach. Finally, the reader model maynot be in complete resonance with this retriever approachand may need to be differently fine-tuned. However, we havebucketed both the retriever and reader models together (andtrain them with the same back-propogation) which mightcause variations in one (say reader) to overtly penalize theother(say retriever).
We conduct ablation studies on HAR to assess the importance ofeach sub-component in our approach. Table 4 provides results ondifferent HAR variations, ablating positional segment embeddingand soft-attention ( πΌ = ) for both fine- and coarse-grained models. Table 4: Ablation Studies for HAR
Model MRR Recall
HAR w/ fine-grained attention
HAR w/ coarse-grained attention πΌ = ) 0.1948 0.2773 β’ Addition of positional segment embedding improvesperformance on both metrics for fine-grained HARbut not for coarse-grained attention . A possible reason-ing for this can be that fine-grained attention works at tokenlevel where encoding positional turn information may helpdifferentiate tokens across conversational turns during soft-attention weight computation or weighted averaging. Forfine-grained attention, this difference amplifies as it is com-puted at token level, while for coarse-grained attention, CLStoken concerns with capturing overall sentence representa-tion and therefore token level differentiation may not help. ole of Attentive History Selection in Conversational Information Seeking Conferenceβ17, July 2017, Washington, DC, USA β’ Soft-attention over history turns improves performanceon both metrics.
This is clearly shown by the results of fine-grained HAR w/o soft Attention which sets soft-attentionscores to 1 for all conversational turns. This justifies ourclaim that different history turns have different contribu-tion to the current question and can improve retrieval. Onecan refer to the analysis in HAM[11] that provides a heatmap of topic return, topic return and drill down - importantphenomena in conversations that are effectively captured bysoft-attention based history modelling.
In this work, we proposed a history attention retriever mechanismfor Open Retrieval question answer settings. We used a conceptintroduced in [11] called history attention mechanism to calculatethe dense representation of query using the history queries askedin the same session. We show that our model can effectively capturethe utility of history turns. We conducted extensive experimentalevaluations to demonstrate the effectiveness of history attentionretriever with positional segment embeddings. A possible futurework in this direction may include exploring performance withhistory attention based reader model, modifying retriever pretrain-ing to also train on conversational history and/or trying differentaggregation approaches with fine-grained attention.
REFERENCES [1] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, PercyLiang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in Context. In
Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - November 4, 2018 , Ellen Riloff, David Chi-ang, Julia Hockenmaier, and Junβichi Tsujii (Eds.). Association for ComputationalLinguistics, 2174β2184. https://doi.org/10.18653/v1/d18-1241[2] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019:The Conversational Assistance Track Overview.
CoRR abs/2003.13624 (2020).arXiv:2003.13624 https://arxiv.org/abs/2003.13624[3] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2019.Multi-step Retriever-Reader Interaction for Scalable Open-domain QuestionAnswering. In . OpenReview.net. https://openreview.net/forum?id=HkfPSh05K7[4] Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can You Un-pack That? Learning to Rewrite Questions-in-Context. In
Proceedings of the2019 Conference on Empirical Methods in Natural Language Processing and the9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguistics, Hong Kong, China, 5918β5924. https://doi.org/10.18653/v1/D19-1605[5] Somil Gupta, Bhanu Pratap Singh Rawat, and Hong Yu. 2020. ConversationalMachine Comprehension: a Literature Review. In
Proceedings of the 28th In-ternational Conference on Computational Linguistics, COLING 2020, Barcelona,Spain (Online), December 8-13, 2020 , Donia Scott, NΓΊria Bel, and ChengqingZong (Eds.). International Committee on Computational Linguistics, 2739β2753.https://doi.org/10.18653/v1/2020.coling-main.247[6] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu,Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrievalfor Open-Domain Question Answering. In
Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing, EMNLP 2020, Online, No-vember 16-20, 2020 , Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.).Association for Computational Linguistics, 6769β6781. https://doi.org/10.18653/v1/2020.emnlp-main.550[7] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, PiyushSharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learn-ing of Language Representations. In . OpenRe-view.net. https://openreview.net/forum?id=H1eA7AEtvS[8] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval forWeakly Supervised Open Domain Question Answering. In
Proceedings of the 57thConference of the Association for Computational Linguistics, ACL 2019, Florence,Italy, July 28- August 2, 2019, Volume 1: Long Papers , Anna Korhonen, David R. Traum, and LluΓs MΓ rquez (Eds.). Association for Computational Linguistics,6086β6096. https://doi.org/10.18653/v1/p19-1612[9] Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W. Bruce Croft, and Mohit Iyyer.2020. Open-Retrieval Conversational Question Answering.
Proceedings of the 43rdInternational ACM SIGIR Conference on Research and Development in InformationRetrieval (Jul 2020). https://doi.org/10.1145/3397271.3401110[10] Chen Qu, Liu Yang, Minghui Qiu, W. Bruce Croft, Yongfeng Zhang, and MohitIyyer. 2019. BERT with History Answer Embedding for Conversational QuestionAnswering. In
Proceedings of the 42nd International ACM SIGIR Conference onResearch and Development in Information Retrieval, SIGIR 2019, Paris, France, July21-25, 2019 , Benjamin Piwowarski, Max Chevalier, Γric Gaussier, Yoelle Maarek,Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1133β1136. https://doi.org/10.1145/3331184.3331341[11] Chen Qu, Liu Yang, Minghui Qiu, Yongfeng Zhang, Cen Chen, W. Bruce Croft,and Mohit Iyyer. 2019. Attentive History Selection for Conversational QuestionAnswering.
Proceedings of the 28th ACM International Conference on Informationand Knowledge Management (Nov 2019). https://doi.org/10.1145/3357384.3357905[12] Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A Conversa-tional Question Answering Challenge.
Trans. Assoc. Comput. Linguistics
Advances in Neural Information Processing Systems 30: Annual Con-ference on Neural Information Processing Systems 2017, 4-9 December 2017, LongBeach, CA, USA , Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998β6008.http://papers.nips.cc/paper/7181-attention-is-all-you-need[14] Mark Yatskar. 2019. A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC.In
Proceedings of the 2019 Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies, NAACL-HLT2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) , JillBurstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computa-tional Linguistics, 2318β2323. https://doi.org/10.18653/v1/n19-1241[15] Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018. To-wards Conversational Search and Recommendation: System Ask, User Respond.In
Proceedings of the 27th ACM International Conference on Information and Knowl-edge Management (Torino, Italy) (CIKM β18)(CIKM β18)