[PDF] Few-Shot Generative Conversational Query Rewriting

Abstract

Conversational query rewriting aims to reformulate a concise conversational query to a fully specified, context-independent query that can be effectively handled by existing information retrieval systems. This paper presents a few-shot generative approach to conversational query rewriting. We develop two methods, based on rules and self-supervised learning, to generate weak supervision data using large amounts of ad hoc search sessions, and to fine-tune GPT-2 to rewrite conversational queries. On the TREC Conversational Assistance Track, our weakly supervised GPT-2 rewriter improves the state-of-the-art ranking accuracy by 12%, only using very limited amounts of manual query rewrites. In the zero-shot learning setting, the rewriter still gives a comparable result to previous state-of-the-art systems. Our analyses reveal that GPT-2 effectively picks up the task syntax and learns to capture context dependencies, even for hard cases that involve group references and long-turn dependencies.

Full PDF

FFew-Shot Generative Conversational Query Rewriting

Shi Yu ∗ , Jiahua Liu ∗ , Jingqin Yang , Chenyan Xiong ,Paul Bennett , Jianfeng Gao , and Zhiyuan Liu Tsinghua University , Microsoft Research AI {yus17, yang-jq17}@mails.tsinghua.edu.cn ; [email protected] ; {chenyan.xiong, Paul.N.Bennett, jfgao}@microsoft.com ; [email protected] ABSTRACT

Conversational query rewriting aims to reformulate a concise con-versational query to a fully specified, context-independent querythat can be effectively handled by existing information retrievalsystems. This paper presents a few-shot generative approach toconversational query rewriting. We develop two methods, basedon rules and self-supervised learning, to generate weak supervi-sion data using large amounts of ad hoc search sessions, and tofine-tune GPT-2 to rewrite conversational queries. On the TRECConversational Assistance Track, our weakly supervised GPT-2rewriter improves the state-of-the-art ranking accuracy by 12%,only using very limited amounts of manual query rewrites. In thezero-shot learning setting, the rewriter still gives a comparableresult to previous state-of-the-art systems. Our analyses reveal thatGPT-2 effectively picks up the task syntax and learns to capture con-text dependencies, even for hard cases that involve group referencesand long-turn dependencies.

KEYWORDS

Conversational Search; Query Rewriting; Few-Shot Learning

ACM Reference Format:

Shi Yu ∗ , Jiahua Liu ∗ , Jingqin Yang , Chenyan Xiong , and Paul Bennett ,Jianfeng Gao , and Zhiyuan Liu . 2020. Few-Shot Generative Conversa-tional Query Rewriting. In Proceedings of the 43rd International ACM SIGIRConference on Research and Development in Information Retrieval (SIGIR ’20),July 25–30, 2020, Virtual Event, China.

ACM, New York, NY, USA, 4 pages.https://doi.org/10.1145/3397271.3401323

Recent advances in deep learning and text understanding facilitatethe transition of information retrieval systems from keyword-basedqueries and “ten-blue” links to more conversational experiences.Widely viewed as a next generation IR direction, ConversationalIR is favored with its ability to satisfy users’ complex informa-tion needs with multi-round interactions, while also providingconvenient and precise information access through conversationalinterfaces and portable devices. ∗ The first two authors contributed equally.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Table 1: A Conversational Search Example in TREC CAsTDescription :The Bronze Age collapse and the transition into a dark age.

Turn Conversational Queries Q Tell me about the Bronze Age collapse. Q What is the evidence for it? Q What are some of the possible causes?

Manual Query Rewrites Q ∗ What is the evidence for the Bronze Age collapse ? Q ∗ ... the possible causes of the Bronze Age collapse ?A signature of Conversational IR is its multi-round interactionswith the user, an opportunity to understand and assist with morecomplex tasks and a challenge to query understanding. Naturalconversations are concise and context dependent. Statements referto previous discussions, omit already mentioned concepts, andassume implicit context during the conversation. Table 1 showsone such example from TREC Conversational Assistance Track(CAsT) 2019. The user begins with a fully specified query ( Q ), butquickly starts to use references ( Q ) and omissions ( Q ), which isvery different from typical keyword-based search sessions.A natural direction to tackle this challenge is to rewrite theconversational queries to de-contextualized queries that include allnecessary information. The manually rewritten queries ( Q ∗ and Q ∗ in Table 1) can be much better handled by existing ad-hoc rankingsystems. In TREC CAsT 2019, various approaches were developedfor this conversational query rewriting task, including IR-style queryexpansion/term reweighting, NLP-style coreference resolution, andneural-based query rewriting. Still, conversational query rewritingis a challenging task: there is 30%+ NDCG drop from systems thatuse automatic query rewriting/reformulation, compared with theircounterparts using manual rewrites [1].One top performing conversational query rewriting system inTREC CAsT is ATeam’s GPT-2 generative query rewriter (a laterversion can be found in [5]). They feed into a pre-trained trans-former language model [4] the previous and current queries in thesession (e.g. Q , Q and Q ), and fine-tune the model to generatethe fully de-contextualized query rewrite ( Q ∗ ). The effectivenessand simplicity of this generative model make it a promising solu-tion for conversational search. However, their GPT-2 was trainedusing their large quantity of manual query rewrites on their ownconversational search queries. It is not clear whether the trans-former language model can still be effectively learned without large a r X i v : . [ c s . I R ] J un mounts of manual query rewrite labels, which are expensive tocollect and are not always available for many domains [1].This work studies learning with GPT-2 in conversational queryrewriting using few or even zero manual rewriting labels. We pro-pose two approaches that generate weak supervision signals forthis task using the ad hoc search sessions abundant in search logs.The first is a rule-based approach which uses two simple rules toomit or co-refer repeated noun phrases in search sessions. Thesecond is a self-supervised learning approach that uses a handfulof manually created query rewrites and conversational queries totrain a GPT-2 model, as a simplifier, to convert the ad hoc searchsessions to more context-dependent, conversational-like queries.These approaches provide large amounts of weak supervision datafor the GPT-2 rewriter to learn the context dependencies in conciseconversational search queries.In the few-shot setting where only TREC CAsT’s manual queryrewrites of 50 conversational sessions are used, ranking with ourquery rewrites outperforms the best automatic runs in CAsT 2019by 12% NDCG@3. In the zero-shot setting, where no manual queryrewrites are used, our weakly supervised GPT-2 still gives compa-rable result to the previous best automatic run in CAsT.We further explore the capability of GPT-2 in few-shot learningby fine-tuning only on a handful of manual query rewrites. Weobserve that, surprisingly, the pre-trained transformer is able topick up this task with as few as three conversational sessions . Wefind that GPT-2 quickly and effectively learns the task syntax: togenerate questions instead of stories and to resolve the contextdependencies using previous turns. We also observe that the modelaccurately deals with hard cases such as ones containing long-termand multiple coreferences. This section describes the application of GPT-2 on the conversa-tional query rewriting task.

Conversational Query Rewriting.

Conversational search sys-tems aim to find relevant documents for queries in a conversa-tional search session S = { Q , ... Q k ..., Q N } [1]. The conversationalqueries are often concise and their information needs are oftenpresented in the previous queries.The conversational query rewriting task is to rewrite a contextdependent query Q k to a fully de-contextualized query Q ′ k , withthe help of previous queries Q < k : Q ′ k = QueryRewriter ( Q k ; Q < k ) , (1)which better reflects user intent and is easier for ad hoc search.We use GPT-2 [1, 4] to directly generate the query words { w ′ , ... w ′ i ..., w ′ M } in Q ′ k one by one as: w ′ i = f ( w ′ < i ; Q k , Q < k ) . (2)where f is transformer decoder and the input is in the format of: Q ◦ [ SEP ] ◦ ... ◦ [

SEP ] ◦ Q k ◦ [ BOS ] ◦ [ w ′ , ..., w ′ i − ] , (3)Both training and inference use standard GPT-2 [4], which isadapted to our task to generate queries instead of plain text [1]. Our code, data, and analyses results are publicly available at https://github.com/thunlp/ConversationQueryRewriter.

In training, the target query Q ∗ k = { w ∗ ... w ∗ m } , either ground truthlabels or weak supervision labels, are used to train the model. Ranking with Query Rewrites.

With the de-contextualizedquery rewrite Q ′ k , standard ad hoc ranking can be used to com-plete the conversational search task. We use the standard BM25 toretrieve 100 documents and a BERT ranker to rerank them [2, 3]. One concern of generative query rewriting is that gold queryrewrites Q ∗ are expensive to obtain. This section describes howwe leverage the ad hoc search sessions, available in search logs, toconstruct weak supervision data to mimic conversational searchsessions with target query rewrites.As current search engines are still moving towards conversa-tional experiences, a typical ad hoc session is less likely to includemany coreferences or omissions. Users may not expect search en-gines to resolve context dependency and tend to write fully specifiedqueries. These fully specified queries, on the other hand, can beused as Q ∗ in the conversational query rewriting task.We consider ad hoc search sessions as pseudo target queryrewrites, ˜ S ∗ = { ˜ Q ∗ , ... ˜ Q ∗ i ..., ˜ Q ∗ N }, and convert them to conversation-like sessions: ˜ S = { ˜ Q , ... ˜ Q i ..., ˜ Q N } . Then ( ˜ S , ˜ S ∗ ) pairs can serve asweak supervision to approximate real conversational queries S andmanual query rewrites S ∗ .To perform this conversion ( ˜ S ∗ → ˜ S ), we propose two approaches,based on rules and self learning, respectively. Rule-Based.

The first approach uses two simple rules to mimictwo discourse phenomena in conversations: omission and corefer-ence . We perform the following operations on search sessions: • Omission.

A noun phrase is omitted if it occurs after apreposition and appears in previous queries; • Coreference.

Otherwise, previously appeared singular andplural noun phrases are respectively replaced with "it" (96%),"he" (2%), or "she" (2%), and "they" (75%) or "them" (25%).Both operations can be done efficiently on a vast amount of sessions.

Self-Learn.

The second approach uses self-supervised learningand trains a GPT-2 model, known as query simplifier, to generatethe conversation-like sessions ˜ S using ˜ S ∗ . Differing from queryrewriting that aims to “put contexts back” to the query, the querysimplifier learns to generate contextual queries containing fewinformation presented in previous queries of the same session.The query simplifier uses a handful manual query rewrites, andlearns to simplify the fully specified query to a contextual query as: Q k = QuerySimplifier ( Q ∗ k ; Q < k ) . (4)Except reversing the source and target ( S ∗ → S ), the same GPT-2setup described in the previous section is used. The query simplifier,trained with a few manual query rewrites, is then applied to the adhoc search sessions (MS MARCO) to generate more conversation-like sessions ( ˜ S ∗ → ˜ S ). Our experiments use the TREC CAsT 2019 benchmark for evalua-tion and the ad hoc sessions from MS MARCO for weak supervision.

TREC CAsT Conversation Search Benchmark.

The datasetconsists of 50 conversational search sessions S , each containing able 2: Overall Results on TREC CAsT 2019 ConversationalSearch Task. * marks scores from [1]. All our runs use thesame ranking model. BLEU-2 are compared with OracleQueries. QA-ROUGE evaluates the answer quality. Method BLEU-2 NDCG@3 QA-ROUGETREC CAsT Auto Runs clacBase* – 0.360 – pgbert* – 0.413 –

CFDA_CLIP_RUN7* – – CAsT Queries

Original

AllenNLP Coref w/o sw – 0.314 –

AllenNLP Coref w/ sw

Oracle

Zero-Shot Rewriter

GPT-2 Raw

MARCO Raw

Rule-Based

Few-Shot Rewriter

Rule-Based + CV w/o PLM

Self-Learn CV Rule-Based + CV

Self-Learn + CV around ten conversational queries. The task is to retrieve and rankrelevant passages for each query in S from the MS MARCO passagecollection and TREC Complex Answer corpora. Standard TRECrelevance judgments are provided. CAsT provides official manuallyrewritten queries for 50 conversational topics [1]. We also manuallylabel answer text for TREC CAsT questions and evaluate questionanswering result. Evaluation Metrics.

The main metric in CAsT is NDCG@3averaged on all turns. We also evaluate the similarity betweenautomatic rewrites and ground truth using BLEU-2 and the questionanswering result using ROUGE-L.

Weak Supervision Dataset and Preprocessing.

The ad hocsearch sessions are collected from MS MARCO . It includes 152Kartificial sessions, with MS MARCO queries automatically alignedto Bing search sessions. We process the DEV sessions to containmore question-like queries, by only retaining those with questionwords, and converting them to the weak supervision data (Sec. 3). Baselines.

We compare with the following query reformationbaselines. They all use the same ad hoc ranking as ours.

Original uses the original queries from TREC CAsT.

AllenNLP Coref uses the query reformulations (with or withoutstopwords) provided by CAsT where AllenNLP is used to resolvecoreferences in search sessions.

GPT-2 Raw directly applies the pre-trained GPT-2 for queryrewriting without fine-tuning.

MARCO Raw fine-tunes GPT-2 on MS MARCO sessions for a lan-guage modeling task instead of the rewriting task. The answers are available at https://github.com/thunlp/ConversationQueryRewriter. https://github.com/microsoft/MSMARCO-Conversational-Search (a) Different Rules (b) Conversational Depth Figure 1: Performances in Different Scenarios. X-axis in (b)shows turn depths and Y-axis is NDCG@3.

Oracle uses the ground truth query rewrites provided by CAsT.This is the oracle run and falls in the manual category of CAsT [1].We also include three automatic runs from CAsT: clacBase , anexpert query reformulation system, pgbert , a GPT-2 rewriter withexternal manual labels, and

CFDA_CLIP_RUN7 , a BERT based queryexpansion system. The last two systems achieve the highest rankingaccuracy among all automatic runs in CAsT 2019 [1].

Implementation Details.

The query rewriter is initialized us-ing the pre-trained GPT-2 (medium) in Pytorch-Transformers.In the zero-shot setting, only the weak supervision data of theconverted MARCO sessions are used to fine-tune GPT-2. We includefor comparison two

Raw baselines and our

Rule-Based method.In the few-shot setting, we also fine-tune on manual rewritesvia five-fold cross validation (CV). We split the folds by sessionsand no testing fold is revealed to model training . Our methods inthis setting include

Rule-Based + CV w/o PLM) , Self-Learn , CV , Rule-Based + CV , and

Self-Learn + CV . We refer readers to ourcode repo for details.Our GPT-2 uses batch size 2, learning rate 5e-5, and max sequencelength 150. Fine-tuning on weak supervision data converges afterone epoch. Cross validation runs until convergence.The ad hoc ranking uses Anserini BM25 with INQUERY stopwordremoval. The BERT ranker fine tunes BERT (base) only using MSMARCO passage ranking labels; the CAsT relevance labels are onlyused in testing; our results are directly comparable with CAsT runs.

This section evaluates the effectiveness of our query rewriter inconversational search and analyzes the behavior of GPT-2.

The overall Results in TREC CAsT are presented in Table 2. Asexpected, the concise and contextual dependent nature of conver-sational search challenges existing ad hoc ranking and coreferenceresolution systems: There is a significant gap between

Original or AllenNLP Coref and manual

Oracle queries. However, the gapis substantially narrowed by our GPT-2 query rewriter.In the few-shot setting, GPT-2 trained with CV already outper-forms the best CAsT auto runs, pgbert and CFDA . Together withweak supervision data,

Rule-Based + CV or Self-Learn + CV a) Training Sessions (b) Training Steps

Figure 2: Performances of GPT-2 with different fine-tuningamounts: conversational sessions with manual rewrites (a)and fine-tuning steps (b). The Y-axes show the correspond-ing metric in (a) and (b). improves the state-of-the-art by 10+%. The improvement is mainlyattributed to better query rewriting: our simple BERT (base) ranker,when using

Oracle queries, is less effective than pgbert and

CFDA teams’ manual runs; they obtained 0.57+ NDCG@3, compared toours 0.544 [1]. The BLEU scores correlate well with NDCG—betterquery rewriting leads to better search accuracy. Our query rewriteralso maintains a stable accuracy in later turns, as shown in Fig. 1b,which indicates that our rewriter effectively captures the multi-turncontext as the conversation proceeds.Surprisingly, GPT-2 ( CV ) provides effective rewrites when onlycross validated on 50 CAsT sessions; Rule-Based , in the zero-shotsetting, is on par with best TREC CAsT automatic runs (Fig. 1ashows their individual effectiveness). In comparison, directly apply-ing (

GPT-2 Raw ) or only fine-tuning using ad hoc sessions (

MARCORaw ) yield sub-par results. It is impressive that the pre-trained trans-former can learn conversational query rewriting, a challenging taskfor previous techniques, in such a data efficient manner.

This study further investigates GPT-2’s capability of generalization.

How Few Shot?

Fig. 2a shows GPT-2 fine-tuned with fewersessions, with or without weak supervision. Exceptionally, GPT-2 learns to generate reasonable query rewrites with only threeconversational sessions or 30 manual labels ; it matches best CAsTauto runs with as few as 10 sessions.

What is Learned?

It is unlikely that GPT-2 learns the discoursephenomena from just three sessions. They are likely to be capturedin pre-training since the non-pre-trained

GPT-2 does not outper-form substantially random guess, as in Table 2.We hypothesize that GPT-2 only needs to learn the “syntax” ofthe rewriting task during fine-tuning: to generate questions andto replace pronouns with or add concepts mentioned in previousturns. Fig. 2b plots the fraction of questions (QueFrac) in GPT-2 ( CV )rewrites, indicated by question words, and the percentage of newwords being copied from previous queries (CopyFrac), at differentfine-tuning steps. GPT-2 adapts to query rewriting very quicklywith very little fine-tuning. Our effectiveness perhaps is more fromproperly “unleashing” the language understanding power alreadyin the pretrained language model. Table 3: GPT-2 Query Rewrites on CAsT Topic 31 and 64. Q What causes throat cancer ? Q What is the first sign of it? Q Is it the same as esophageal cancer ? Q What’s the difference in their symptoms?Oracle What’s the difference in throat cancer andesophageal cancer’s symptoms?Output What’s the difference between throat cancerand esophageal cancer ? Q What are the types of pork ribs ? Q What are baby backs? Q What are the differences with spareribs? Q What are ways to cook them? Q How about on the bbq?Oracle How do you cook pork ribs on the bbq?Output How about on the bbq?

Table 3 provides two examples from GPT-2 (

Rule-based + CV ). Wefound it surprising that in the first case, GPT-2 accurately resolvesthe group coreference from “their” to two cancer types, with oneof the two from three turns ago. The second example presents acommon error made by our rewriter: it fails to add proper contextperhaps because in this case it is not clear what the context the term“about” refers to. In our manual analyses, we found that GPT-2’serrors are more often due to missing complete contexts than dueto adding false information.

This work demonstrates the effectiveness of GPT-2 for conversa-tional query rewriting. Fine-tuned using weak supervision datagenerated by rules or a handful of manual rewriting labels, ourGPT-2 query rewriter is able to create new state-of-the-art on theTREC CAsT conversational search benchmark—outperforming pre-vious methods including query expansion, contextual ranking, andcoreference resolution, many of which use large-scale pre-trainedmodels and deep neural networks.

ACKNOWLEDGEMENTS

This work is supported by the National Key Research and Develop-ment Program of China (No. 2018YFB1004503) and the National Nat-ural Science Foundation of China (NSFC No. 61732008, 61532010).

REFERENCES [1] Jeff Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The Conversa-tional Assistance Track Overview. In

TREC 2019 . NIST.[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In

Proceedings of NAACL 2019 .[3] R. Nogueira and K. Cho. 2019. Passage Re-ranking with BERT.

ArXiv abs/1901.04085 (2019).[4] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).[5] Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha.2020. Question Rewriting for Conversational Question Answering.