[PDF] Conversational Question Answering over Passages by Leveraging Word Proximity Networks

Abstract

Question answering (QA) over text passages is a problem of long-standing interest in information retrieval. Recently, the conversational setting has attracted attention, where a user asks a sequence of questions to satisfy her information needs around a topic. While this setup is a natural one and similar to humans conversing with each other, it introduces two key research challenges: understanding the context left implicit by the user in follow-up questions, and dealing with ad hoc question formulations. In this work, we demonstrate CROWN (Conversational passage ranking by Reasoning Over Word Networks): an unsupervised yet effective system for conversational QA with passage responses, that supports several modes of context propagation over multiple turns. To this end, CROWN first builds a word proximity network (WPN) from large corpora to store statistically significant term co-occurrences. At answering time, passages are ranked by a combination of their similarity to the question, and coherence of query terms within: these factors are measured by reading off node and edge weights from the WPN. CROWN provides an interface that is both intuitive for end-users, and insightful for experts for reconfiguration to individual setups. CROWN was evaluated on TREC CAsT data, where it achieved above-median performance in a pool of neural methods.

Full PDF

CConversational Question Answering over Passagesby Leveraging Word Proximity Networks

Magdalena Kaiser

Max Planck Institute for InformaticsSaarland Informatics CampusSaarbrücken, [email protected]

Rishiraj Saha Roy

Max Planck Institute for InformaticsSaarland Informatics CampusSaarbrücken, [email protected]

Gerhard Weikum

Max Planck Institute for InformaticsSaarland Informatics CampusSaarbrücken, [email protected]

ABSTRACT

Question answering (QA) over text passages is a problem of long-standing interest in information retrieval. Recently, the conversa-tional setting has attracted attention, where a user asks a sequenceof questions to satisfy her information needs around a topic. Whilethis setup is a natural one and similar to humans conversing witheach other, it introduces two key research challenges: understand-ing the context left implicit by the user in follow-up questions, anddealing with ad hoc question formulations. In this work, we demon-strate Crown ( C onversational passage ranking by R easoning O ver W ord N etworks): an unsupervised yet effective system for conver-sational QA with passage responses, that supports several modesof context propagation over multiple turns. To this end, Crownfirst builds a word proximity network (WPN) from large corporato store statistically significant term co-occurrences. At answeringtime, passages are ranked by a combination of their similarity tothe question, and coherence of query terms within: these factorsare measured by reading off node and edge weights from the WPN.Crown provides an interface that is both intuitive for end-users,and insightful for experts for reconfiguration to individual setups.Crown was evaluated on TREC CAsT data, where it achievedabove-median performance in a pool of neural methods. CCS CONCEPTS • Information systems → Question answering . KEYWORDS

Conversational Search, Conversational Question Answering, Pas-sage Ranking, Word Networks

ACM Reference Format:

Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2020. Conver-sational Question Answering over Passages by Leveraging Word ProximityNetworks. In

ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Motivation.

Question answering (QA) systems [13] return directanswers to natural language queries, in contrast to the standardpractice of document responses. These crisp answers are aimed at re-ducing users’ effort in searching for relevant information, and maybe in the form of short text passages [5], sentences [19], phrases [3],or entities from a knowledge graph [10]. In this work, we deal withpassages: such passage retrieval [17] has long been an area of focusfor research in information retrieval (IR), and is tightly coupled withtraditional text-based QA [18]. Passages are one of the most flexibleanswering modes, being able to satisfy both objective (factoid) andsubjective (non-factoid) information needs succinctly.Of late, the rise of voice-based personal assistants [6] like Siri,Cortana, Alexa, or the Google Assistant has drawn attention tothe scenario of conversational question answering (ConvQA) [4, 15].Here, the user, instead of a one-off query, fires a series of questionsto the system on a topic of interest. Effective passage retrievaloften holds the key to satisfying such responses, as short passagesor paragraphs are often the most that can be spoken out loud, ordisplayed in limited screen real estate, without sacrificing coherence.The main research challenge brought about by this shift to theconversational paradigm is to resolve the unspoken context of thefollow-up questions. Consider our running example conversationbelow, that is a mix of factoid (turns 1, 2 and 3), and non-factoidquestions (turns 4 and 5). Answers shown are excerpts from topparagraphs retrieved by an ideal passage-based ConvQA system.

Question (Turn 1): when did nolan make his batman movies?

Answer:

Nolan launched one of the Dark Knight’s most successfuleras with Batman Begins in 2005, The Dark Knight in 2008, and thefinal part of the trilogy The Dark Knight Rises in 2012.

Question (Turn 2): who played the role of alfred?

Answer: ... a returning cast: Michael Caine as Alfred Pennyworth...

Question (Turn 3): and what about harvey dent?

Answer:

The Dark Knight featured Aaron Eckhart as Harvey Dent.

Question (Turn 4): how was the box office reception?

Answer:

The Dark Knight earned . million in North Americaand . million in other territories for a worldwide total of billion. Question (Turn 5): compared to Batman v Superman?

Answer:

Outside of Christopher Nolan’s two Dark Knight movies,Batman v Superman is the highest-grossing property in DC’s bullpen. a r X i v : . [ c s . I R ] M a y IGIR ’20, July 25–30, 2020, Xi’an, China Kaiser et al.

This canonical conversation illustrates implicit context in follow-up questions. In turn 2, the role of alfred refers to the one in Nolan’sBatman trilogy; in turn 3, what about refers to the actor playingthe role of Harvey Dent (the Batman movies remain as additionalcontext all through turn 5); in turn 5, compared to alludes to the box office reception as the point of comparison. Thus, ConvQA isfar more than coreference resolution and question completion [12].

Relevance.

Conversational QA lies under the general umbrellaof conversational search , that is of notable contemporary interest inthe IR community. This is evident through recent forums like theTREC Conversational Assistance Track (CAsT) [7], the Dagstuhlseminar on Conversational Search [1], and the ConvERSe work-shop at WSDM 2020 on Conversational Systems for E-Commerce .Our proposal Crown was originally a submission to TREC CAsT,where it achieved above-median performance on the track’s evalu-ation data and outperformed several neural methods. Approach and contribution.

Motivated by the lack of a sub-stantial volume of training data for this novel task, and the goalof devising a lightweight and efficient system, we developed our unsupervised method Crown ( C onversational passage ranking by R easoning O ver W ord N etworks) that relies on the flexibility ofweighted graph-based models. Crown first builds a backbone graphreferred to as a Word Proximity Network (WPN) that stores wordassociation scores estimated from large passage corpora like MSMARCO [14] or TREC CAR [8]. Passages from a baseline modellike Indri are then re-ranked according to their similarity weightsto question terms (represented as node weights in the WPN), whilepreferring those passages that contain term pairs deemed signifi-cant from the WPN, close by. Such coherence is determined by usingedge weights from the WPN.

Context is propagated by various mod-els of (decayed) weighting of words from previous turns.Crown enables conversational QA over passage corpora in aclean and intuitive UI, with interactive response times. As far aswe know, this is the first public and open-source demo for ConvQAover passages. All our material is publicly available at: • Online demo: https://crown.mpi-inf.mpg.de/ • Walkthrough video: http://qa.mpi-inf.mpg.de/crownvideo.mp4 • Code: https://github.com/magkai/CROWN.

Word co-occurrence networks built from large corpora have beenwidely studied [2, 9], and applied in many areas, like query intentanalysis [16]. In such networks, nodes are distinct words, and edgesrepresent significant co-occurrences between words in the samesentence. In Crown, we use proximity within a context window,and not simple co-occurrence: hence the term Word ProximityNetwork (WPN). The intuition behind the WPN construction isto measure the coherence of a passage w.r.t. a question, where wedefine coherence by words appearing in close proximity, computedin pairs. We want to limit such word pairs to only those that matter, https://bit.ly/2Vf2iQg https://wsdm-converse.github.io/ Figure 1: Sample conversation and word proximity network. i.e., have been observed significantly many times in large corpora.This is the information stored in the WPN.Here, we use NPMI (normalized Pointwise Mutual Information)for word association: npmi ( x , y ) = log p ( x , y ) p ( x )· p ( y ) /− loд p ( x , y ) where p ( x , y ) is the joint probability distribution and p ( x ) , p ( y ) are theindividual unigram distributions of words x and y (no stopwordsconsidered). The NPMI value is used as edge weight between thenodes that are similar to conversational query tokens (Sec. 2.2).Node weights measure the similarity between conversational querytokens and WPN nodes appearing in the passage.Fig. 1 shows the first three turns of our running example, togetherwith the associated fragment (possibly disconnected as irrelevantedges are not shown) from the WPN. Matching colors indicatewhich of the query words is closest to that in the correspondingpassage. For example, nolan has a direct match in the first turn,giving it a node weight (compared using word2vec cosine similarity)of 1 .

0. If this similarity is below a threshold, then the correspondingnode will not be considered further ( caine or financial , greyed out).Edge weights are shown as edge labels, considered only if theyexceed a certain threshold: for instance, the pairs ( batman , movie )and ( harvey , dent ), with NPMI ≥ .

7, qualify here. These edges arehighlighted in orange. Edges like ( financial , success ) with weightabove the threshold are not considered as they are irrelevant to theinput question (low node weights), even though they appear in thegiven passages. To propagate context, Crown expands the query at a given turn T using three possible strategies to form a conversational query cq . cq is constructed from previous query turns q t (possibly weightedwith w t ) seen so far: • Strategy cq simply concatenates the current query q T and q .No weights are used. • cq concatenates q T , q T − and q , where each component hasa weight w = . , w T = . , w T − = ( T − )/ T . • cq concatenates all previous turns with decaying weights ( w t = t / T ), except for the first and the current turns ( w = . , w T = . onversational Question Answering over Passages SIGIR ’20, July 25–30, 2020, Xi’an, China This cq is first passed through Indri to retrieve a set of candidatepassages, and then used for re-ranking these candidates (Sec. 2.3). The final score of a passage P i consists of several components thatwill be described in the following text. Estimating similarity.

Similarity is computed using node weights: score node ( P i ) = n (cid:213) j = C ( p ij )· max k ∈ q t , q t ∈ cq ( sim ( vec ( p ij ) , vec ( cq k ))· w t ) where C ( p ij ) is 1 if the condition C ( p ij ) is satisfied, else 0 (seebelow for a definition of C ). vec ( p ij ) is the word2vec vector ofthe j th token in the i th passage; vec ( cq k ) is the correspondingvector of the k th token in the conversational query cq and w t is theweight of the turn in which the k th token appeared; sim denotesthe cosine similarity between the passage token and the querytoken embeddings. C ( p ij ) is defined as C ( p ij ) : = ∃ cq k ∈ cq : sim ( vec ( p ij ) , vec ( cq k )) > α which means that condition C is onlyfulfilled if the similarity between a query and a passage word isabove a threshold α . Estimating coherence.

Coherence is calculated using edge weights: score edдe ( P i ) = n (cid:213) j = W (cid:213) k = j + C ( p ij , p ik ) , C ( p ij , p ik ) · N PMI ( p ij , p ik ) C ( p ij , p ik ) : = hasEdдe ( p ij , p ik ) ∧ N PMI ( p ij , p ik ) > βC ( p ij , p ik ) : = ∃ cq r , cq s ∈ cq : sim ( vec ( p ij ) , vec ( cq r )) > α ∧ sim ( vec ( p ik ) , vec ( cq s )) > α ∧ cq r (cid:44) cq s ∧ (cid:154) cq r ′ , cq s ′ ∈ cq : sim ( vec ( p ij ) , vec ( cq r ′ )) > sim ( vec ( p ij ) , vec ( cq r ))∨ sim ( vec ( p ik ) , vec ( cq s ′ )) > sim ( vec ( p ik ) , vec ( cq s )) C ensures that there is an edge between the two tokens in theWPN, with edge weight > β . C states that there are two words in cq where one is that which is most similar to p ij , and the other isthe most similar to p ik . Context window size W is set to three. Estimating positions.

Passages with relevant sentences earliershould be preferred. The position score of a passage is defined as: score pos ( P i ) = max s j ∈ P i ( j ·( score node ( P i )[ s j ] + score edдe ( P i )[ s j ])) where s j is the j th sentence in passage P i and score node ( P i )[ s j ] isnode score for the sentence s j in P i . Estimating priors.

We also consider the original ranking fromIndri, which can often be very useful: score indri ( P i ) = / rank ( P i ) where rank is the rank that the passage P i received from Indri. Putting it together.

The final score for a passage P i consistsof a weighted sum of these four individual scores: score ( P i ) = h · score indri ( P i ) + h · score node ( P i ) + h · score edдe ( P i ) + h · score pos ( P i ) , where h , h , h and h are hyperparameters tuned onTREC CAsT data. The detailed method and the evaluation results ofCrown are available in our TREC report [11]. General informationabout CAsT can be found in the TREC overview report [7]. Figure 2: Overview of the Crown architecture.

An overview of our system architecture is shown in Fig. 2. The democonsists of a frontend and a backend, connected via a RESTful API.

Frontend.

The frontend has been created using the Javascript li-brary React. There are four main panels: the search panel, the panelcontaining the sample conversation, the results’ panel, and the ad-vanced options’ panel. Once the user presses the answer button,their current question, along with the conversation history accu-mulated so far, and the set of parameters, are sent to the backend.A detailed walkthrough of the UI will be presented in Sec. 4.

Backend.

The answering request is sent via JSON to a PythonFlask App, which works in a multi-threaded way to be able to servemultiple users. It forwards the request to a new CROWN instancewhich computes the results as described in Sec. 2. The Flask Appsends the result back to the frontend via JSON, where it is displayedon the results’ panel.

Implementation Details.

The demo requires ≃

170 GB disk spaceand ≃

20 GB memory. The frontend is in Javascript, and the backendis in Python. We used pre-trained word vec embeddings that wereobtained via the Python library gensim . The Python library spaCy has been used for tokenization and stopword removal. As previouslymentioned, Indri has been used for candidate passage retrieval.For graph processing, we used the Python library NetworkX . Answering questions.

We will guide the reader through our demousing our running example conversation from Sec. 1 (Fig. 3).

Figure 3: Conversation serving as our running example. https://radimrehurek.com/gensim/ https://spacy.io/ https://networkx.github.io/3 IGIR ’20, July 25–30, 2020, Xi’an, China Kaiser et al.

The demo is available at https://crown.mpi-inf.mpg.de. One canstart by typing a new question into the search bar and pressing

Answer , or by clicking

Answer Sample for quickly getting the systemresponses for the running example.

Figure 4: Search bar and rank-1 answer snippet at turn 1.

Fig. 4 shows an excerpt from the top-ranked passage for this firstquestion ( when did nolan make his batman movies? ), that clearlysatisfies the information need posed in this turn. For quick naviga-tion to pertinent parts of large passages, we highlight up to threesentences from the passage (number determined by passage length)that have the highest relevance (again, a combination of similar-ity, coherence, position) to the conversational query. In addition,important keywords are in bold : these are the top-scoring nodesfrom the WPN at this turn.The search results are displayed in the answer panel below thesearch bar. In the default setting, the top-3 passages for a query aredisplayed. Let us go ahead and type the next question ( who playedthe role of alfred? ) into the input box, and explore the results (Fig. 5).Again, we find that the relevant nugget of information ( ... MichaelCaine as Alfred Pennyworth... ) is present in the very first passage.We can understand the implicit context in ConvQA from this turn,as the user does not need to specify that the role sought afteris from Nolan’s batman movies. The top nodes and edges from theWPN are shown just after the passage id from the corpus: nodes like

Figure 5: Top-1 answer passage for question in turn 2. batman and nolan , and edges like (batman, role) . These contribute tointerpretability of the system by the end-user, and help in debuggingfor the developer. We now move on to the third turn: and what aboutharvey dent? , as shown in Fig. 6. Here, the context is even moreimplicit, and the complete intent of role in nolan’s batman movies isleft unspecified. The answer is located at rank three now (see video).

Figure 6: Answer at the 3rd-ranked passage for turn 3.

Similarly, we can proceed with the next two turns. The result forthe current question is always shown on top, while answers forprevious turns do not get replaced but are shifted further down foreasy reference. In this way, a stream of (question, answer) passagesis created. Passages are displayed along with their id and the topnodes and edges found by Crown. In the example from Figure 5not only alfred and role but also batman and nolan , which havebeen mentioned in the previous turn, are among the top nodes.

Clearing the buffer.

If users now want to initiate a new conver-sation, they can press the

Clear All button. This will remove alldisplayed answers and clear the conversation history. In case usersjust want to delete their previous question (and the response), theycan use the

Clear Last button. This is especially helpful when ex-ploring the effect of the configurable parameters on responses at agiven turn.

Figure 7: Advanced options for an expert user.Advanced options.

An expert user can change several Crown pa-rameters, as illustrated in Fig. 7. The first two are straightforward:the number of top passages to display , and to fetch from the underly-ing

Indri model . The node weight threshold α (Sec. 2.3) can be tuned onversational Question Answering over Passages SIGIR ’20, July 25–30, 2020, Xi’an, China Figure 8: A summarizing description of Crown. depending on the level of approximate matching desired: the higherthe threshold, the more exact matches are preferred. The edge weightthreshold β is connected to the level of statistical significance ofthe word association measure used: the higher the threshold, themore significant the term pair is constrained to be. Tuning thesethresholds are constrained to fixed ranges (node weights: 0 . − . . − .

1) so as to preclude accidentally introducinga large amount of noise in the system.The conversational query model should be selected dependingupon the nature of the conversation. If all questions are on thesame topic of interest indicated by the first question, then the in-termediate turns are not so important (select current+first turns ).On the other hand, if the user keeps drifting from concept to con-cept through the course of the conversation, then the current andthe previous turn should be preferred (select current+previous+firstturns ). If the actual scenario is a mix of the two, select all turnsproportionate weights . The first two settings may be referred to as star and chain conversations , respectively [4]. Finally, the relativeweights ( hyperparameters ) of the four ranking criteria can be con-figured freely (0 . − . .

3, for example. Such changes in options are reflected immediately when a new question is asked. Default valueshave been tuned on TREC CAsT 2019 training samples.

RestoreDefaults will reset values back to their defaults. A brief descriptionsummarizes our contribution (Fig. 8).

We demonstrated Crown, one of the first prototypes for unsuper-vised conversational question answering over text passages. Crownresolves implicit context in follow-up questions by expanding thecurrent query with keywords from previous turns, and uses thisnew conversational query for scoring passages using a weightedcombination of similarity, coherence, and positions of approximatematches of query terms. In terms of empirical performance, Crownscored above the median in the Conversational Assistance Trackat TREC 2019, being comparable to several neural methods. Thepresented demo is lightweight and efficient, as evident in its inter-active response rates. The clean UI design makes it easily accessiblefor first-time users, but contains enough configurable parametersso that experts can tune Crown to their own setups.A very promising extension is to incorporate answer passages asadditional context to expand follow-up questions, as users often for-mulate their next questions by picking up cues from the responsesshown to them. Future work would also incorporate fine-tunedBERT embeddings and corpora with more information coverage.

REFERENCES [1] Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and BennoStein. 2020. Conversational Search (Dagstuhl Seminar 19461).

Dagstuhl Reports

9, 11 (2020).[2] Ramon Ferrer i Cancho and Richard V. Solé. 2001. The small world of humanlanguage.

Royal Soc. of London. Series B: Biological Sciences

EMNLP .[4] Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh,and Gerhard Weikum. 2019. Look before you Hop: Conversational QuestionAnswering over Knowledge Graphs Using Judicious Context Expansion. In

CIKM .[5] Daniel Cohen, Liu Yang, and W Bruce Croft. 2018. WikiPassageQA: A benchmarkcollection for research on non-factoid answer passage retrieval. In

SIGIR .[6] Paul A. Crook, Alex Marin, Vipul Agarwal, Samantha Anderson, Ohyoung Jang,Aliasgar Lanewala, Karthik Tangirala, and Imed Zitouni. 2018. Conversationalsemantic search: Looking beyond Web search, Q&A and dialog systems. In

WSDM .[7] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The Conver-sational Assistance Track Overview. In

TREC .[8] Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. TRECComplex Answer Retrieval Overview. In

TREC .[9] Sergey N Dorogovtsev and José Fernando F Mendes. 2001. Language as anevolving word web.

Royal Society of London. Series B: Biological Sciences

WSDM .[11] Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2019. CROWN:Conversational Passage Ranking by Reasoning over Word Networks. In

TREC .[12] Vineet Kumar and Sachindra Joshi. 2017. Incomplete follow-up question resolu-tion using retrieval based sequence to sequence learning. In

SIGIR .[13] Xiaolu Lu, Soumajit Pramanik, Rishiraj Saha Roy, Abdalghani Abujabal, YafangWang, and Gerhard Weikum. 2019. Answering Complex Questions by JoiningMulti-Document Evidence with Quasi Knowledge Graphs. In

SIGIR .[14] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, RanganMajumder, and Li Deng. 2016. MS MARCO: A Human-Generated MAchineReading COmprehension Dataset. In

NeurIPS .[15] Chen Qu, Liu Yang, Minghui Qiu, Yongfeng Zhang, Cen Chen, W Bruce Croft,and Mohit Iyyer. 2019. Attentive History Selection for Conversational QuestionAnswering. In

CIKM .[16] Rishiraj Saha Roy, Niloy Ganguly, Monojit Choudhury, and Naveen Kumar Singh.2011. Complex network analysis reveals kernel-periphery structure in Websearch queries. In

QRU (SIGIR Workshop) .[17] Gerard Salton, James Allan, and Chris Buckley. 1993. Approaches to passageretrieval in full text information systems. In

SIGIR .[18] Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton.2003. Quantitative evaluation of passage retrieval algorithms for question an-swering. In

SIGIR .[19] Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A challenge datasetfor open-domain question answering. In