Supporting search engines with knowledge and context
aa r X i v : . [ c s . I R ] F e b upporting Search Engineswith Knowledge and Context Nikos Voskarides upporting Search Engineswith Knowledge and Context A CADEMISCH P ROEFSCHRIFT ter verkrijging van de graad van doctor aan deUniversiteit van Amsterdamop gezag van de Rector Magnificusprof. dr. ir. K.I.J. Maexten overstaan van een door het College voor Promoties ingesteldecommissie, in het openbaar te verdedigen inde Agnietenkapelop vrijdag 5 februari 2021, te 16:00 uurdoor
Nikos Voskarides geboren te Lefkosia romotiecommissie
Promotor: prof. dr. Maarten de Rijke Universiteit van AmsterdamCopromotor: dr. Edgar J. Meij BloombergOverige leden: prof. dr. Paul Groth Universiteit van Amsterdamprof. dr. Evangelos Kanoulas Universiteit van Amsterdamdr. Maarten Marx Universiteit van Amsterdamprof. dr. Simone Teufel University of Cambridgedr. Suzan Verberne Universiteit LeidenFaculteit der Natuurwetenschappen, Wiskunde en InformaticaThe research was supported by the Netherlands Organisation for Scientific Research(NWO) under project number CI-14-25.Copyright © 2021 Nikos Voskarides, Amsterdam, The NetherlandsCover by Evros VoskaridesPrinted by Ipskamp Printing, AmsterdamISBN: 978-94-6421-207-5 cknowledgements
Maarten, your supervision has been transformative. Thank you for caring and forpushing me to become better.Edgar, you have been a source of energy. Thank you for believing in me.ILPS is a fantastic research group, mainly because of Maarten’s vision and consistency.Mano, thank you for sharing your enthusiasm and for making sure I’d findmy way.Petra, thank you for going above and beyond for the group.Ridho, for being my knowledge graph buddy.Katya, for having an alternative viewpoint.Marzieh, for reminding me of my roots.Anne, Daan, for encouraging me.Vangeli, for being unconventional.Harrie, for being a good listener.Ana, for speaking up.Ke, for sharing your passion.Rolf, for being punctual.Dan, for introducing me to the Guqin.Christophe, for setting the bar high.Sabrina, for seeing my work through a different lens.Bob, Chang, David, Hamid, Hosein, Ilya, Isaac, Maartje, Marlies, Mostafa,Pengjie, Tom, Wouter, Zhaochun, Ziming, and all the other ILPSers I workedwith over the years, thank you for the fun memories.I spent most of my PhD studies in Amsterdam.Giorgo, your help has been unmeasurable; thank you for sharing the burdens andmultiplying the joys.Mario, thank you for not compromising.You both have been there for me in happy and rough times.Achillea, Eleni, Fee, Gianni, Ioanna, Kyriaco, Pari, Sofia, Stathi, you brought mejoy every time I was around you. Thank you.During my PhD studies I did two internships, both of which were incredibly rewardingexperiences.n the summer of 2017, I interned at Bloomberg in London. Thank you Edgar,Ridho, Abhinav, Anju, Malvina, Miles, Diego for being great hosts. Nicoletta,Thiago, thank you for looking out for me. Iakove, Dora, Leo, Marilena, for allthe fun times.In the summer of 2018, I interned at Amazon in Barcelona. Roi, Hugo, Lluis,Vassili, Marc, Jordi, thank you for making my stay memorable.Thank you Maarten, Paul, Simone, Suzan and Vangeli for agreeing to serve in mycommittee. Giorgo, Katya, thank you for being my paranymphs. Harrie, thank you fortranslating my thesis summary to Dutch.Next, my family.Papa, Mamma, your unconditional love and the example you have set have taughtme the importance of knowledge and context. Thank you for enduring my absenceall these years.This thesis is dedicated to you.Christofore, Evro, thank you for the (silent) support.Pappou, for the inspiration.Iris, thank you. You are my only home. Nikos VoskaridesLefkosia, January 2021 ontents
I Making Structured Knowledge more Accessible to the User 9
ONTENTS
II Improving Interactive Knowledge Gathering 63
ONTENTS
III Supporting Knowledge Exploration for Narrative Creation 87
ONTENTS
Bibliography 115Summary 123Samenvatting 125 viii
Introduction
Search engines leverage large repositories of knowledge to improve information ac-cess [115, 128, 147]. These repositories may store unstructured knowledge such astextual documents or social media posts, or structured knowledge such as attributes ofand relationships between real-world objects and topics.In order to effectively leverage knowledge, search engines should account forcontext, i.e., additional information about the user and the query [7, 18, 165, 194]. Inthis thesis, we study how to support search engines in leveraging knowledge whileaccounting for different types of context: (1) context that the search engine proactivelyprovides to enrich search results (e.g., information on tourist attractions when searchingfor a city), (2) context that stems from the interactions between the user and the searchengine in a conversational search session, or (3) context provided by the user to specifya broad query.Search engine result pages (SERPs) present information that is meant to be relevantto the user’s query [13, 78, 168]. Apart from the traditional “ten blue links”, modernSERPs are increasingly being enriched with additional context that often comes fromstructured knowledge sources to enhance the user experience [47]. Knowledge graphs(KGs), which store world knowledge in the form of facts, are the most prominentstructured knowledge source for search engines [25, 106, 147]. This is natural since themajority of queries issued to search engines contain entities stored in KGs [70]. KGs areused to support different components of modern SERPs, such as direct answers to userqueries and KG panels [208], which present facts about the entity identified in the queryand other, related entities to support exploratory search (see Figure 1.1) [21, 25, 75, 120].A challenge that arises when presenting such structured knowledge in a SERP is that itis stored in a formal form, not directly suitable for presentation to the user. Tacklingthis challenge is the focus of the first research theme of this thesis, where we study howto make structured knowledge more accessible to the search engine user.Users interact with the search results presented to them in multiple ways and theyprovide signals that may be used by search engines to improve the user experience,for instance by continuously learning better ranking functions [38, 79, 80, 84, 88].Recent advances in natural language processing and deep learning have enabled thewide-spread use of interactive systems in real-world applications [64], which, in turn,has fueled a resurgence of research in conversational search [6, 45, 46, 155, 210]. Inconversational search, the user interacts with the search engine during relatively short1 . Introduction
Figure 1.1: Part of a SERP KG panel in response to the query “Bill Gates” (split in twoparts).sessions to gather knowledge over large unstructured knowledge repositories [16, 43]. Aprominent challenge in conversational search is that the search engine has to keep trackof the evolving context during the conversation so as to enable more natural interactions.Addressing this challenge is the focus of the second research theme of this thesis, wherewe study how to identify relevant context from the conversation history in order toimprove interactive knowledge gathering.Search engines facilitate knowledge gathering for different types of users. A largeportion of research in information retrieval has focused on how to answer informationneeds of users in web search [27, 117, 124, 211]. In contrast to web search, in pro-fessional search, users express their information needs in a different way and aim toaccess and explore domain-specific knowledge [91, 157, 183]. Writers are a type ofprofessional users who heavily rely on search engines [74, 173]. For instance, writersin the scientific domain use search engines to find relevant references to include in theirarticles [19, 76]. Another prominent example of professional search engine users arewriters in the news domain [44, 51]. Such writers create narratives around specificevents and use search engines to support them in this process [39, 93, 161]. In the thirdresearch theme of this thesis, we study how to support writers explore unstructuredknowledge about past events given an incomplete narrative that specifies a main eventand a context.
This thesis focuses on three research themes aimed at supporting search engines withknowledge and context: (1) making structured knowledge more accessible to the user bydescribing and contextualizing KG facts (Chapters 2, 3 and 4), (2) improving interactiveknowledge gathering by identifying relevant context in conversational search (Chap-ter 5), and (3) supporting knowledge exploration for narrative creation by retrieving2 .1. Research Outline and Questions event-focused news articles in context (Chapter 6).Below, we describe the main research questions for each chapter. In each chapterwe describe more fine-grained subquestions that we ask to answer each main researchquestion.
SERPs often include structured knowledge for queries that mention real-world entitiesin the form of KG facts. Facts are stored in KGs in a formal form (e.g., (cid:104)
Bill Gates,founderOf, Microsoft (cid:105) ). When presenting a KG fact to the user, however, it is morenatural to use human-readable descriptions that verbalize and contextualize the fact [66].For instance, a possible description of the KG fact (cid:104)
Bill Gates, founderOf, Microsoft (cid:105) is:
Bill Gates is an American business magnate and the principal founder of MicrosoftCorporation.
In our first study (Chapter 2), we cast the problem of finding suchdescriptions as a retrieval task:
RQ1
Given a KG fact and a text corpus, can we retrieve textual descriptions of the factfrom the text corpus?We propose a method that first extracts and enriches candidate sentences that may bereferring to the entities of the fact from a text corpus, and then ranks those sentences.Our results show that we can reliably retrieve sentences that accurately describe a givenfact, under the condition that a relevant sentence exists in the underlying text corpus.However, it is likely that this condition does not hold in cases where a given factis not explicitly described in the text corpus at hand. This limits the applicability ofour proposed method in real-world scenarios. In order to address this limitation, in oursecond study (Chapter 3), we consider a text generation task:
RQ2
Given a KG fact, can we automatically generate a textual description of the factin the absence of an existing description?We propose to first create sentence templates for each relationship in the KG usingexisting fact descriptions. Then, given a KG fact that expresses a specific relationship,we select a relevant template and fill it using additional information from the KG (otherfacts), if needed. We find that our method can generate contextually rich descriptionsand is robust against KG incompleteness.KG fact descriptions often contain mentions of other, related facts that provideadditional context and thus increase the user’s understanding of the fact as a whole (e.g.,
Bill Gates founded Microsoft with Paul Allen ). Given the large size of KGs, many factscould potentially be relevant to the fact of interest, thus we need to automate the task offinding those other facts. This is the focus of our next study (Chapter 4):
RQ3
Can we contextualize a KG query fact by retrieving other, related KG facts?We propose a method that first enumerates other candidate facts in the neighborhood ofthe query fact and then ranks those facts with respect to their relevance to the query fact.We propose the neural fact contextualization method (NFCM), a neural ranking modelthat combines automatically learned and hand-crafted features. In addition, we propose3 . Introduction to use a distant supervision method to automatically gather training data for NFCM. Wefind that NFCM outperforms several baseline methods and that distant supervision iseffective for this task.
The ultimate goal of conversational AI is interactive knowledge gathering [64]. Searchengines can play a crucial role towards achieving that goal. An interactive search engineshould support conversational search, where a user aims to interactively find informationstored in large unstructured knowledge repositories [45].In our next study (Chapter 5), we focus on multi-turn passage retrieval as an instanceof conversational search [46]. Here, the query at the current turn may be underspecified.Thus, we need to identify relevant context from the conversation history to arrive at abetter expression of the query. We answer the following research question:
RQ4
Can we use query resolution to identify relevant context and thereby improveretrieval in conversational search?Here, query resolution refers to the task of adding missing context from the conversationhistory to the current turn query, if needed. We propose to model query resolution as aterm classification task. We design query resolution by term classification (QuReTeC),a neural query resolution model based on bidirectional transformers. Since obtaininghuman-curated training data specifically for query resolution may be cumbersome, wepropose a distant supervision method that automatically generates supervision data forQuReTeC using query-passage relevance pairs. We find that when integrating QuReTeCin a multi-stage ranking architecture we can significantly outperform baseline models.In addition, we find that the distant supervision method we propose can substantiallyreduce the amount of human-curated training data required to train QuReTeC.
Writers such as journalists often use search engines to find relevant material to includein event-oriented narratives [51, 83, 140]. Such material can provide backgroundknowledge on the event itself or connections to other events that can help writersgenerate new angles on the narrative and thus better engage the reader [39, 93]. Previouswork has focused on exploring knowledge for narrative creation from different sources,such as social media [44, 52, 213], or from sources with a more narrow scope, such aspolitical speeches [113].In our next study (Chapter 6), we focus on supporting knowledge exploration from acorpus of event-centric news articles for narrative creation. More specifically, we studya real-world scenario where the writer has already generated an incomplete narrativethat specifies a main event and a context, and aims to retrieve relevant news articles thatdiscuss other events from the past. We answer the following research question:
RQ5
Can we support knowledge exploration for event-centric narrative creation byperforming news article retrieval in context?4 .2. Main Contributions
We formally define this task and propose a retrieval dataset construction procedure thatrelies on existing news articles to simulate incomplete narratives and relevant articles.We conduct experiments on two datasets derived from this procedure and find thatstate-of-the-art lexical and semantic rankers are not sufficient for this task. We findthat combining those rankers with one that ranks articles by reverse chronologicalorder outperforms those rankers alone. We also perform an in-depth quantitative andqualitative analysis of the results along different dimensions to acquire insights into thecharacteristics of this task.
In this section, we summarize the main contributions of this thesis.
Theoretical contributions
1. We formalize the task of retrieving knowledge graph fact descriptions stored in atext corpus (Chapter 2).2. We formalize the task of generating knowledge graph fact descriptions (Chap-ter 3).3. We formalize the task of knowledge graph fact contextualization (Chapter 4).4. We formulate the task of query resolution for conversational search as termclassification (Chapter 5).5. We formalize the task of news article retrieval in context for event-centric narrativecreation (Chapter 6).
Algorithmic contributions
6. A learning to rank method that combines a rich set of features for retrievingknowledge graph fact descriptions (Chapter 2).7. A method for generating knowledge graph fact descriptions by template construc-tion and filling (Chapter 3).8. Neural fact contextualization method (NFCM), a method for contextualizingknowledge graph facts, and a distant supervision method for gathering trainingdata automatically (Chapter 4).9. Query resolution by term classification (QuReTeC), a method for query resolutionfor multi-turn passage ranking, and a distant supervision method for gatheringtraining data automatically (Chapter 5).10. A retrieval dataset construction procedure for the task of news article retrieval incontext for event-centric narrative creation (Chapter 6). 5 . Introduction
Empirical contributions
11. Retrieving knowledge graph fact descriptions (Chapter 2)(a) Empirical comparison of our proposed learning to rank model and othersentence retrieval methods.(b) Empirical comparison of relationship-dependent models against an indepen-dent model.(c) Analysis of how different feature types contribute to the performance of ourmodel and an error analysis of common errors made by our model.12. Generating knowledge graph fact descriptions (Chapter 3)(a) Empirical comparison of different methods by automatic and manual evalu-ation.(b) Analysis of specific cases where our method succeeds or fails.13. Contextualizing knowledge graph facts (Chapter 4)(a) Empirical comparison of NFCM and heuristic baselines.(b) We show that learning ranking functions using distant supervision is benefi-cial.(c) Analysis of the effect of handcrafted and automatically learned features onretrieval effectiveness.14. Query resolution for conversational search (Chapter 5)(a) Empirical comparison of QuReTeC and multiple baselines of differentnature.(b) We show that distant supervision can substantially reduce the amount ofgold standard training data needed to train QuReTeC.(c) Qualitative analysis of specific cases where our method succeeds or fails.15. News article retrieval in context (Chapter 6)(a) Empirical comparison of state-of-the-art lexical rankers on this task.(b) We show that a combination of lexical and semantic rankers with one thatranks articles by reverse chronological order outperforms those rankersalone.(c) An in-depth quantitative and qualitative analysis of the performance of therankers under comparison among different dimensions.6 .3. Thesis Overview
Resources
16. A manually annotated dataset for knowledge graph fact description retrieval.17. An automatically extracted dataset for knowledge graph fact description genera-tion.18. A manually annotated dataset for knowledge graph fact contextualization.19. An open source implementation of QuReTeC.20. An automatically extracted dataset for news article retrieval in context.
The thesis is organized in three parts.In the first part we study how to make KG facts more accessible to users in searchapplications. Specifically, given a specific KG fact, we study how to retrieve textualdescriptions of the fact (Chapter 2), how to generate a textual description of the fact inthe absence of an existing description (Chapter 3), and how to retrieve other KG facts tocontextualize the fact (Chapter 4).In the second part we study how to improve interactive knowledge gathering byperforming query resolution for multi-turn passage retrieval (Chapter 5).In the third part we study how to support narrative creation by performing newsarticle retrieval in context (Chapter 6).In Chapter 7 we conclude the thesis and discuss directions for future work.
Below we list which publication is the origin of each chapter.
Chapter 2 is based on the conference paper: N. Voskarides, E. Meij, M. Tsagkias,M. de Rijke, and W. Weerkamp. Learning to explain entity relationships in knowledgegraphs. In
ACL-IJCNLP . ACL, 2015 [185].NV designed the method and ran the experiments. EM helped with algorithmicdesign. All authors conributed to the text, NV did most of the writing.
Chapter 3 is based on the conference paper: N. Voskarides, E. Meij, and M. de Rijke.Generating descriptions of entity relationships. In
ECIR . Springer, 2017 [186].NV designed the method and ran the experiments. All authors contributed to thetext, NV did most of the writing.
Chapter 4 is based on the conference paper: N. Voskarides, E. Meij, R. Reinanda,A. Khaitan, M. Osborne, G. Stefanoni, K. Prabhanjan, and M. de Rijke. Weakly-supervised contextualization of knowledge graph facts. In
SIGIR . ACM, 2018 [187].NV designed the method and ran the experiments. EM, RR contributed to theexperimental design. AK helped with the infrastructure. All authors contributed to thetext, NV did most of the writing. 7 . Introduction
Chapter 5 is based on the conference paper: N. Voskarides, D. Li, P. Ren, E. Kanoulas,and M. de Rijke. Query resolution for conversational search with limited supervision.In
SIGIR . ACM, 2020 [189].NV designed the method and ran the experiments. DL contributed to the experimen-tal design and ran baseline models. All authors contributed to the text, NV did most ofthe writing.
Chapter 6 is based on the conference paper: N. Voskarides, E. Meij, S. Sauer, andM. de Rijke. News article retrieval in context for event-centric narrative creation. In
Under submission , 2020 [190].NV designed the method and ran the experiments. All authors contributed to thetext, NV did most of the writing.The thesis also indirectly benefited from insights gained from the following publications:• N. Voskarides, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Query-dependent contextualization of streaming data. In
ECIR . Springer, 2014 [184].• N. Voskarides, D. Li, A. Panteli, and P. Ren. ILPS at TREC 2019 ConversationalAssistant Track. TREC, NIST, 2019 [188].• G. Sidiropoulos, N. Voskarides, and E. Kanoulas. Knowledge graph simplequestion answering for unseen domains. In
AKBC , 2020 [166].• F. Sarvi, N. Voskarides, L. Mooiman, S. Schelter, and M. de Rijke. A comparisonof supervised learning to match methods for product search. In eCOM 2020: The2020 SIGIR Workshop on eCommerce . ACM, 2020 [160].• A. M. Krasakis, M. Aliannejadi, N. Voskarides, and E. Kanoulas. Analysing theeffect of clarifying questions on document ranking in conversational search. In
ICTIR . ACM, 2020 [95].8 art I
Making Structured Knowledgemore Accessible to the User Retrieving Knowledge Graph FactDescriptions
In the first part of this thesis, we study how to make structured knowledge moreaccessible to the user. In this chapter, we aim to answer
RQ1 : Given a KG fact and atext corpus, can we retrieve textual descriptions of the fact from the text corpus?Knowledge graph (KG) facts express entity relationships in a formal form. In thescope of this chapter we use the term “explaining entity relationships” as an alias for“retrieving KG fact descriptions”.
Knowledge graphs are a powerful tool for supporting a large spectrum of search appli-cations including ranking, recommendation, exploratory search, and web search [56].A knowledge graph aggregates information around entities across multiple contentsources and links these entities together, while at the same time providing entity-specificproperties (such as age or employer) and types (such as actor or movie).Although there is a growing interest in automatically constructing knowledge graphs,e.g., from unstructured web data [42, 60, 193], the problem of providing evidenceon why two entities are related in a knowledge graph remains largely unaddressed.Extracting and presenting evidence for linking two entities, however, is an importantaspect of knowledge graphs, as it can enforce trust between the user and a searchengine, which in turn can improve long-term user engagement, e.g., in the context ofrelated entity recommendation [21]. Although knowledge graphs exist that provide thisfunctionality to a certain degree (e.g., when hovering over Google’s suggested entities,see Figure 2.1), to the best of our knowledge there is no previously published researchon methods for entity relationship explanation.In this chapter we propose a method for explaining the relationship between twoentities, which we evaluate on a newly constructed annotated dataset that we makepublicly available. In particular, we consider the task of explaining relationships betweenpairs of Wikipedia entities. We aim to infer a human-readable description for an entitypair given a relationship between the two entities. Since Wikipedia does not explicitly
This chapter was published as [185]. . Retrieving Knowledge Graph Fact Descriptions Figure 2.1: Part of Google’s search result page for the query “barack obama”. Whenhovering over the related entity “Michelle Obama”, an explanation of the relationshipbetween her and “Barack Obama” is shown.define relationships between entities we use a knowledge graph to obtain these relations.We cast our task as a sentence ranking problem: we automatically extract sentencesfrom a corpus and rank them according to how well they describe a given relationshipbetween a pair of entities. For ranking purposes, we extract a rich set of features and uselearning to rank to effectively combine them. Our feature set includes both traditionalinformation retrieval and natural language processing features that we augment withentity-dependent features. These features leverage information from the structure of theknowledge graph. On top of this, we use features that capture the presence in a sentenceof the relationship of interest. For our evaluation we focus on “people” entities and weuse a large, manually annotated dataset of sentences.We break down
RQ1 to three research sub-questions. First, we ask what theeffectiveness of state-of-the-art sentence retrieval models is for explaining a relationshipbetween two entities (
RQ1.1 ). Second, we consider whether we can improve oversentence retrieval models by casting the task in a learning to rank framework (
RQ1.2 ).Third, we examine whether we can further improve performance by using relationship-12 .2. Related Work dependent models instead of a relationship-independent one (
RQ1.3 ). We complementthese research questions with an error and feature analysis.Our main contributions are a robust and effective method for explaining entityrelationships, detailed insights into the performance of our method and features, and amanually annotated dataset.
We combine ideas from sentence retrieval, learning to rank, and question answering toaddress the task of explaining relationships between entities.Previous work that is closest to the task we address in this chapter is that of Blancoand Zaragoza [20] and Fang et al. [61]. First, Blanco and Zaragoza [20] focus on findingand ranking sentences that explain the relationship between an entity and a query. Ourwork is different in that we want to explain the relationship between two entities, ratherthan a query. Fang et al. [61] explore the generation of a ranked list of knowledgebase relationships for an entity pair. Instead, we try to select sentences that describe aparticular relationship, assuming that this is given.Our approach builds on sentence retrieval, where one retrieves sentences rather thandocuments that answer an information need. Document retrieval models such as tf-idf,BM25, and language modeling [11] have been extended to tackle sentence retrieval.Three of the most successful sentence retrieval methods are TFISF [8], which is avariant of the vector space model with tf-idf weighting, language modeling with localcontext [62, 126], and a recursive version of TFISF that accounts for local context [54].TFISF is very competitive compared to document retrieval models tuned specifically forsentence retrieval (e.g., BM25 and language modeling [110]) and we therefore includeit as a baseline.Sentences that are suitable for explaining relationships can have attributes that areimportant for ranking but cannot be captured by term-based retrieval models. One wayto combine a wide range of ranking features is learning to rank (LTR). Recent yearshave witnessed a rapid increase in the work on learning to rank, and it has proven tobe a very successful method for combining large numbers of ranking features, for websearch, but also other information retrieval applications [3, 28, 171]. We use learningto rank and represent each sentence with a set of features that aim to capture differentdimensions of the sentence.Question answering (QA) is the task of providing direct and concise answers toquestions formed in natural language [77]. QA can be regarded as a similar task to ours,assuming that the combination of entity pair and relationship form the “question” andthat the “answer” is the sentence describing the relationship of interest. Even thoughwe do not follow the QA paradigm in this chapter, some of the features we use areinspired by QA systems. In addition, we employ learning to rank to combine the devisedfeatures, which has recently been successfully applied for QA [3, 171]. 13 . Retrieving Knowledge Graph Fact Descriptions
We address the problem of explaining relationships between pairs of entities in aknowledge graph. We operationalize the problem as a problem of ranking sentencesfrom documents in a corpus that is related to the knowledge graph. More specifically,given two entities e i and e j that form an entity pair (cid:104) e i , e j (cid:105) , and a relation r betweenthem, the task is to extract a set of candidate sentences S ij = { s ij , . . . , s ij k } thatrefer to (cid:104) e i , e j (cid:105) and to impose a ranking on the sentences in S ij . The relation r has the general form (cid:104) type ( e i ) , terms ( r ) , type ( e j ) (cid:105) , where type ( e ) is the type of theentity e (e.g., Person or Actor ) and terms ( r ) are the terms of the relation (e.g., CoCastsWith or IsSpouseOf ).We are left with two specific tasks: (1) extracting candidate sentences S ij , and(2) ranking S ij , where the goal is to have sentences that provide a perfect explanationof the relationship at the top position of the ranking. The next section describes ourmethods for both tasks. We follow a two-step approach for automatically explaining relationships between entitypairs. First, in Section 2.4.1, we extract and enrich sentences that refer to an entity pair (cid:104) e i , e j (cid:105) from a corpus in order to construct a set of candidate sentences. Second, inSection 2.4.2, we extract a rich set of features describing the entities’ relationship r anduse supervised machine learning in order to rank the sentences in S ij according to howwell they describe the relationship r . To create a set of candidate sentences for a given entity pair and relationship, we requirea corpus of documents that is pertinent to the entities at hand. Although any kind ofdocument collection can be used, we focus on Wikipedia in this chapter, as it providesgood coverage for the majority of entities in our knowledge graph.First, we extract surface forms for the given entities: the title of the entity’sWikipedia article (e.g., “Barack Obama”), the titles of all redirect pages linking tothat article (e.g., “Obama”), and all anchor text associated with hyperlinks to the articlewithin Wikipedia (e.g., “president obama”). We then split all Wikipedia articles intosentences and consider a sentence as a candidate if (i) the sentence is part of eitherentities’ Wikipedia article and contains a surface form of, or a link to, the other entity;or (ii) the sentence contains surface forms of, or links to, both entities in the entity pair.Next, we apply two sentence enrichment steps for (i) making sentences self-contained and readable outside the context of the source document and (ii) linkingthe sentences to entities. For (i), we replace pronouns in candidate sentences with thetitle of the entity. We apply a simple heuristic for the people entities, inspired by [197]: we count the frequency of the terms “he” and “she” in the article for determining We experimented with the Stanford co-reference resolution system [101] and Apache OpenNLP andfound that they were not able to consistently achieve the level of effectiveness that we require. .4. Explaining Entity Relationships the gender of the entity, and we replace the first appearance of “he” or “she” in eachsentence with the entity’s title. We skip this step if any surface form of the entity occursin the sentence.For (ii), we apply entity linking to provide links from the sentence to additionalentities [121]. This need arises from the fact that not every sentence in an article containsexplicit links to the entities it mentions, as Wikipedia guidelines only allow one link toanother article in the article’s text. The algorithm takes a sentence as input and iteratesover n-grams that are not yet linked to an entity. If an n-gram matches a surface form ofan entity, we establish a link between the n-gram and the entity. We restrict our searchspace to entities that are linked from within the source article of the sentence and fromwithin articles to which the source article links. This way, our entity linking methodachieves high precision as almost no disambiguation is necessary.As an example, consider the sentence “He gave critically acclaimed performancesin the crime thriller Seven. . . ” on the Wikipedia page for Brad Pitt. After applyingour enrichment steps, we obtain “
Brad Pitt gave critically acclaimed performancesin the crime thriller
Seven . . . ”, making the sentence human readable and link to theentities
Brad Pitt and
Seven (1995 film) . After extracting candidate sentences, we rank them by how well they describe therelationship of interest r between entities e i and e j . There are many signals beyondsimple term statistics that can indicate relevance. Automatically constructing a rankingmodel using supervised machine learning techniques is therefore an obvious choice.For ranking we use learning to rank (LTR) and represent each sentence with a richset of features. Tables 2.1 and 2.2 list the features we use. Below we provide a briefdescription of the more complex ones. Text features
This feature type regards the importance of the sentence s at the termlevel. We compute the density of s (feature 4) as: density ( s ) = 1 K · ( K + 1) n (cid:88) j =1 idf ( t j ) · idf ( t j +1 ) distance ( t j , t j +1 ) , (2.1)where K is the number of keyword terms in s and distance ( t j , t j +1 ) is the numberof non-keyword terms between keyword terms t j and t j +1 . We treat stop words andnumbers in s as non-keywords and the remaining terms as keywords. Features 5–8capture the distribution of part-of-speech tags in the sentence. Entity features
These features partly build on [118, 177] and describe the entitiesand are dependent on the knowledge graph. Whether e i or e j is the first appearingentity in a sentence might be an indicator of importance (feature 13). The spread of e i and e j in the sentence (feature 14) might be an indicator of their centrality in thesentence [20]. Features 15–22 capture the distribution of part-of-speech tags in the http://en.Wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking . Retrieving Knowledge Graph Fact Descriptions Table 2.1: Text and entity features used for sentence ranking.
Text features s in words2 Sum of idf Sum of IDF of terms of s in Wikipedia3 Average idf Average IDF of terms of s in Wikipedia4 Sentence density Lexical density of s , see Equation 2.1 [100]5–8 POS fractions Fraction of verbs, nouns, adjectives, othersin s [122] Entity features s
10 Link to e i Whether s contains a link to the entity e i
11 Link to e j Whether s contains a link to the entity e j
12 Links to e i and e j Whether s contains links to both entities e i and e j
13 Entity first Is e i or e j the first entity in the sentence?14 Spread of e i , e j Distance between the last match of e i and e j in s [20]15–22 POS fractions left/right Fraction of verbs, nouns, adjectives, others tothe left/right window of e i and e j in s [122]23–25 e i and e j in s
26 common links e i , e j Whether s contains any common link of e i and e j
27 e i and e j in s
28 Score common links e i , e j Sum of the scores of the common links of e i and e j in s e i and e j inprevious/next sentence of s sentence in a window of four words around e i or e j in s [122], complemented by thenumber of entities between, to the left of, and to the right of the entity pair (features23–25).We assume that two articles that have many common articles that point to them arestrongly related [196]. We hypothesize that, if a sentence contains common inlinks from e i and e j , the sentence might contain important information about their relationship.Hence, we add whether the sentence contains a common link (feature 26) and thenumber of common links (feature 27) as features. We score a common link l between e i and e j using: score ( l, e i , e j ) = sim ( l, e i ) · sim ( l, e j ) , (2.2)where sim ( · , · ) is defined as the similarity between two Wikipedia articles, computed16 .4. Explaining Entity Relationships Table 2.2: Relationship and source features used for sentence ranking.
Relationship features
31 Match terms ( r ) ? Whether s contains any term in terms ( r )
32 Match wordnet ( r ) ? Whether s contains any phrase in wordnet ( r )
33 Match word2vec ( r ) ? Whether s contains any phrase in word2vec ( r ) s
39 Average word2vec ( r ) Average cosine similarity of phrases in word2vec ( r ) that are matched in s
40 Maximum word2vec ( r ) Maximum cosine similarity of phrases in word2vec ( r ) that are matched in s
41 Sum word2vec ( r ) Sum of cosine similarity of phrases in word2vec ( r ) that are matched in s
42 Score LC Lucene score of s with titles ( e i , e j ) , terms ( r ) , wordnet ( r ) , word2vec ( r ) as query43 Score R-TFISF R-TFISF score of s with queries constructed asabove Source features
44 Sentence position Position of s in document from which it origi-nates45 From e i or e j ? Does s originate from the Wikipedia article of e i or e j ?46 e i or e j ) Number of occurrences of e i or e j in documentfrom which s originates, inspired by documentsmoothing for sentence retrieval [125]using a variant of Normalized Google Distance [196]. Feature 28 then measures thesum of the scores of the common links.Previous research shows that using surrounding sentences is beneficial for sentenceretrieval [54]. We therefore consider the number of common links in the previous andnext sentence (features 29–30). Relationship features
Feature 31 indicates whether any of the relationship-specificterms occurs in the sentence. Only matching the terms in the relationship may have lowcoverage since terms such as “spouse” may have many synonyms and/or highly relatedterms, e.g., “husband” or “married”. Therefore, we use WordNet to find synonymphrases of r (feature 32); we refer to this method as wordnet ( r ) .Alternatively, we use word embeddings to find such similar phrases [119]. Suchembeddings take a text corpus as input and learn vector representations of wordsand phrases consisting of real numbers. Given the set V r consisting of the vector17 . Retrieving Knowledge Graph Fact Descriptions representations of all the relationship terms and the set V which consists of the vectorrepresentations of all the candidate phrases in the data, we calculate the distance betweena candidate phrase represented by a vector v i ∈ V and the vectors in V r as: distance ( v i , V ) = cos v i , (cid:88) v j ∈ V r v j , (2.3)where (cid:80) v j ∈ V r v j is the element-wise sum of the vectors in V r and the distance betweentwo vectors v and v is measured using cosine similarity. The candidate phrases in V are then ranked using Equation 2.3 and the top- m phrases are selected, resulting infeatures 33, 39, 40, and 41; we refer to the ranked set of phrases that are selected usingthis procedure as word2vec ( r ) .In addition, we employ state-of-the-art retrieval functions and include the scores forqueries that are constructed using the entities e i and e j , the relation r , wordnet ( r ) , and word2vec ( r ) . We use the titles of the entity articles titles ( e ) to represent the entities inthe query and two ranking functions, Recursive TFISF (R-TFISF) and LC, (features42–43). TFISF is a sentence retrieval model that determines the level of relevance of asentence s given a query q as: R ( s, q ) = (cid:88) t ∈ q log( tf t,q + 1) · log( tf t,s + 1) · log (cid:18) n + 10 . sf t (cid:19) , (2.4)where tf t,q and tf t,s are the number of occurrences of term t in the query q and thesentence s respectively, sf t is the number of sentences in which t appears, and n is thenumber of sentences in the collection. R-TFISF is an improved extension of the TFISFmethod [54], which incorporates context from neighboring sentences in the rankingfunction: R c ( s, q ) = (1 − µ ) R ( s, q ) + µ [ R c ( s prev ( s ) , q ) + R c ( s next ( s ) , q )] , where µ is a free parameter and s prev ( s ) and s next ( s ) indicate functions to retrieve theprevious and next sentence, respectively. We use a maximum of three recursive calls. Source features
Here, we refer to features that are dependent on the source documentof the sentences. We have three such features.
In this section we describe the dataset, manual annotations, learning to rank algorithm,and evaluation metrics that we use to answer our research questions. In preliminary experiments R-TFISF and LC were the best performing among a pool of sentence retrievalmethods. .5. Experimental Setup We draw entities and their relationships from a proprietary knowledge graph that iscreated from Wikipedia, Freebase, IMDB, and other sources, and that is used by theYahoo web search engine. We focus on “people” entities and relationships betweenthem. For our experiments we need to select a manageable set of entities, which weobtain as follows. We consider a year of query logs from a large commercial searchengine, count the number of times a user clicks on a Wikipedia article of an entity inthe results page and perform stratified sampling of entities according to this distribution.As we are bounded by limited resources for our manual assessments, we sample 1476entity pairs that together with nine unique relationship types form our experimentaldataset.We use an English Wikipedia dump dated July 8, 2013, containing approximately4M articles, of which 50638 belong to “people” entities that are also in our knowledgegraph. We extract sentences using the approach described in Section 2.4.1, resulting in36823 candidate sentences for our entities. On average we have 24.94 sentences perentity pair (maximum 423 and minimum 0). Because of the large variance, it is notfeasible to obtain exhaustive annotations for all sentences. We rank the sentences usingR-TFISF and keep the top-10 sentences per entity pair for annotation. This results in atotal of 5689 sentences.Five human annotators provided relevance judgments, manually judging sentencesbased on how well they describe the relationship for an entity pair, for which we use afive-level graded relevance scale (perfect, excellent, good, fair, bad). Of all relevancegrades 8.1% is perfect, 15.69% excellent, 19.98% good, 8.05% fair, and 48.15% bad.Out of 1476 entity pairs, 1093 have at least one sentence annotated as fair. As iscommon in information retrieval evaluation, we discard entity pairs that have only “bad”sentences. We examine the difficulty of the task for human annotators by measuringinter-annotator agreement on a subset of 105 sentences that are judged by 3 annotators.Fleiss’ kappa is k = 0 . , which is considered to be moderate agreement. For ranking sentences we use a Random Forest (RF) classifier [26]. We set the numberof iterations to 300 and the sampling rate to 0.3. Experiments with varying these twoparameters did not show any significant differences. We also tried several featurenormalization methods, none of them being able to significantly outperform the runswithout feature normalization.We obtain POS tags using the Stanford part-of-speech tagger and filter out a standardlist of 33 English stopwords. For the word embeddings we use word2vec and train ourmodel on all text in Wikipedia using negative sampling and the continuous bag of wordsarchitecture. We set the size of the phrase vectors to 500 and m = 30 . Note that, except for the co-reference resolution step described in Section 2.4.1, our method does notdepend on this restriction. https://github.com/nickvosk/acl2015-dataset-learning-to-explain-entity-relationships In preliminary experiments, we contrasted RF with gradient boosted regression trees and LambdaMARTand found that RF consistently outperformed other methods. . Retrieving Knowledge Graph Fact Descriptions Table 2.3: Results for five baseline variants. See text for their description and significantdifferences. Baseline NDCG@1 NDCG@10 ERR@1 ERR@10B1 0.7508 0.8961 0.3577 0.4531B2 0.7511 0.8958 0.3584 0.4530B3 0.7595 0.8997 0.3696 0.4600B4 0.7767 0.9070 0.3774 0.4672B5
We employ two main evaluation metrics in our experiments, NDCG [85] and ERR [33].The former measures the total accumulated gain from the top of the ranking that isdiscounted at lower ranks and is normalized by the ideal cumulative gain. The lattermodels user behavior and measures the expected reciprocal rank at which a user willstop her search. We consider these ranking-based graded evaluation metrics at twocut-off points: position 1, corresponding to showing a single sentence to a user, and10, which accounts for users who might look at more results. We report on NDCG@1,NDCG@10, ERR@1, ERR@10, and Exc@1, which indicates whether we have an“excellent” or “perfect” sentence at the top of the ranking. Likewise, Per@1 indicateswhether we have a “perfect” sentence at the top of the ranking (not all entity pairs havean excellent or a perfect sentence).We perform 5-fold cross validation and test for statistical significance using a pairedtwo-tailed t-test. We depict a significant difference in performance for p < . with (cid:78) (gain) and (cid:72) (loss) and for p < . with (cid:77) (gain) and (cid:79) (loss). Boldface indicates thebest score for a metric. We compare the performance of typical document retrieval models and state-of-the-art sentence retrieval models in order to answer
RQ1.1 . We consider five sentenceretrieval models: Lucene ranking (LC), language modeling with Dirichlet smoothing(LM), BM25, TFISF, and Recursive TF-ISF (R-TFISF). We follow related work and set µ = 0 . for R-TFISF, k = 1 and b = 0 for BM25 and µ = 250 for LM [62].In our experiments, a query q is constructed using various combinations of surfaceforms of the two entities e i and e j and the relationship r . Each entity in the entity paircan be represented by its title, the titles of any redirect pages pointing to the entity’sarticle, the n-grams used as anchors in Wikipedia to link to the article of the entity,or the union of them all. The relationship r can be represented by the terms in therelationship, synonyms in wordnet ( r ) , or by phrases in word2vec ( r ) .First, we fix the way we represent r . Baseline B1 does not include any representationof r in the query. B2 includes the relationship terms of r , while B3 includes therelationship terms of r and the synonyms in wordnet ( r ) . B4 includes the terms of r and20 .6. Results and Analysis the phrases in word2vec ( r ) , and B5 includes the relationship terms of r , the synonymsin wordnet ( r ) and the phrases in word2vec ( r ) . Combining these variations with theentity representations, we find that all combinations that use the titles as representationand R-TFISF as the retrieval function outperform all other combinations.This canbe explained by the fact that titles are least ambiguous, thus reducing the possibilityof accidentally referring to other entities. BM25 and LC perform almost as well asR-TFISF, with only insignificant differences in performance.Table 2.3 shows the best performing combination of each baseline, i.e., varying therepresentation of r and using titles and R-TFISF. B4 and B5 are the best performingbaselines, suggesting that word2vec ( r ) and wordnet ( r ) are beneficial. B5 significantlyoutperforms all baselines except B4.We also experiment with a supervised combination of the baseline rankers using LTR.Here, we consider each baseline ranker as a separate feature and train a ranking model.The trained model is not able to outperform the best individual baseline, however. Next, we provide the results of our method using the features described in Section 2.4.2,exploring whether our learning to rank (LTR) approach improves over sentence retrievalmodels (
RQ1.2 ). We compare an LTR model using the features in Tables 2.1 and 2.2against the best baseline (B5).Table 2.4 shows the results. Each group in the table contains the results for the entitypairs that have at least one candidate sentence of that relevance grade for B5 and LTR.We find that LTR significantly outperforms B5 by a large margin. The absoluteperformance difference between LTR and B5 becomes larger for all metrics as we movefrom “fair” to “perfect,” which shows that LTR is more robust than the baseline forentity pairs that have at least one high quality candidate sentence. LTR ranks the bestpossible sentence at the top of the ranking for ∼
83% of the cases for entity pairs thatcontain an “excellent” sentence and for ∼
72% of the cases for entity pairs that containa “perfect” sentence.Note that, as indicated in Section 2.5.1, we discard entity pairs that have only “bad”sentences in our experiments. For the sake of completeness, we report on the resultsfor all entity pairs in our dataset—including those without any relevant sentences—inTable 2.5.
Relevant sentences may have different properties for different relationship types. Forexample, a sentence describing two entities being partners would have a different formthan one describing that two entities costar in a movie. A similar idea was investigatedin the context of QA for associating question and answer types [205]. To answer
RQ1.3 we examine whether learning a relationship-dependent model improves overlearning a single model for all types. We split our dataset per relationship type andtrain a model per type using 5-fold cross-validation within each. Table 2.6 showsthe results. Our method is robust across different relationships in terms of NDCG.However, we observe some variation in ERR as this metric is more sensitive to the21 . Retrieving Knowledge Graph Fact Descriptions T a b l e . : R e s u lt s f o r t h e b e s t b a s e li n e ( B ) a nd t h e l ea r n i ng t o r a nk m e t hod ( LT R ) . H a s on e a i r s s e n t e n ce s M e t hod ND C G @ ND C G @ E RR @ E RR @ E x c @ P e r @ f a i r B . . . . LT R . (cid:78) . (cid:78) . (cid:78) . (cid:78) ––good10384285 B . . . . LT R . (cid:78) . (cid:78) . (cid:78) . (cid:78) –– e x ce ll e n t B . . . . . LT R . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) –p e rf ec t B . . . . . . LT R . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) . (cid:78) .6. Results and Analysis T a b l e . : R e s u lt s f o r t h e b e s t b a s e li n e ( B ) a nd t h e l ea r n i ng t o r a nk m e t hod ( LT R ) , u s i ng a ll e n tit yp a i r s i n t h e d a t a s e t , i n c l ud i ng t ho s e w it hou t a ny r e l e v a n t s e n t e n ce s . H a s on e a i r s s e n t e n ce s M e t hod ND C G @ ND C G @ E RR @ E RR @ E x c @ P e r @ - B . . . . LT R . (cid:78) . (cid:78) . (cid:78) . (cid:78) –– . Retrieving Knowledge Graph Fact Descriptions T a b l e . : R e s u lt s f o rr e l a ti on s h i p - d e p e nd e n t m od e l s . S i m il a rr e l a ti on s h i p s a r e g r oup e d t og e t h e r . R e l a ti on s h i p a i r s s e n t e n ce s ND C G @ ND C G @ E RR @ E RR @ (cid:104) M o v i e A c t o r , C o C a s t s W i t h , M o v i e A c t o r (cid:105) . . . . (cid:104) T v A c t o r , C o C a s t s W i t h , T v A c t o r (cid:105) . . . . (cid:104) M o v i e A c t o r , I s D i r ec t e d B y , M o v i e D i r ec t o r (cid:105)(cid:104) M o v i e D i r ec t o r , D i r ec t s , M o v i e A c t o r (cid:105) . . . . (cid:104) P e r s o n , i s C h i l d O f , P e r s o n (cid:105) (cid:104) P e r s o n , i s P a r e n t O f , P e r s o n (cid:105) . . . . (cid:104) P e r s o n , i s P a r t n e r O f , P e r s o n (cid:105) (cid:104) P e r s o n , i s S po u s e O f , P e r s o n (cid:105) . . . . (cid:104) A t h l e t e , P l a y s S a m e S po r t T e a m A s , A t h l e t e (cid:105) . . . . A v e r a g e r e s u lt s ov e r a ll d a t a . . . . LT R ( T a b l e . ; f a i r) . . . . .6. Results and Analysis Table 2.7: Results using relationship-dependent models, removing individual featuretypes. Features NDCG@1 NDCG@10 ERR@1 ERR@10All
All \ text 0.8620 0.9372 0.4606 0.5274All \ source 0.8598 0.9372 0.4582 0.5261All \ entity 0.8421 (cid:79) (cid:72) (cid:79) All \ relation 0.8183 (cid:72) (cid:72) (cid:72) (cid:72) distribution of relevant items than NDCG—the distribution over relevance grades variesper relationship type. For example, it is much more likely to find candidate sentencesthat have a high relevance grade for (cid:104) Person , isSpouseOf , Person (cid:105) than for (cid:104)
Athlete , PlaysSameSportTeamAs , Athlete (cid:105) in our dataset. We plan to address this issue byexploring other corpora in the future.The second-to-last row in Table 2.6 shows the averaged results over the differentrelationship types, which is a significant improvement over LTR at p < . for allmetrics. This method ranks the best possible sentence at the top of the ranking for ∼
85% of the cases for entity pairs that contain an “excellent” sentence ( ∼
2% absoluteimprovement over LTR) and for ∼
75% of the cases for entity pairs that contain a“perfect” sentence ( ∼
3% absolute improvement over LTR).
Next, we analyze the impact of the feature types. Table 2.7 shows how performancevaries when removing one feature type at a time from the full feature set. Relationshiptype features are the most important, although entity type features are important as well.This indicates that introducing features based on entities identified in the sentencesand the relationship is beneficial for this task. Furthermore, the limited dependency onthe source feature type indicates that our method might be able to generalize in otherdomains. Finally, text type features do contribute to retrieval effectiveness, although notsignificantly. Note that calculating the sentence features is straightforward, as none ofour features requires heavy linguistic analysis.
When looking at errors made by the system, we find that some are due to the fact thatentity pairs might have more than one relationship (e.g., actors that costar in moviesalso being partners) but the selected sentence covers only one of the relationships. Forexample,
Liza Minnelli is the daughter of
Judy Garland , but they have alsocostarred in a movie, which is the relationship of interest. The model ranks the sentence“Liza Minnelli is the daughter of singer and actress Judy Garland. . . ” at the top, while The annotators marked sentences that do not refer to the relationship of interest as “bad” but indicatedwhether they describe another relationship or not. We plan to account for such cases in future work. . Retrieving Knowledge Graph Fact Descriptions the most relevant sentence is: “Judy Garland performed at the London Palladium withher then 18-year-old daughter Liza Minnelli in November 1964.”Sentences that contain the relationship in which we are interested, but for whichthis cannot be directly inferred, are another source of error. Consider, for example,the following sentence, which explains director Christopher Nolan directed ac-tor
Christian Bale : “Jackman starred in the 2006 film The Prestige, directed byChristopher Nolan and costarring Christian Bale, Michael Caine, and Scarlett Johans-son”. Even though the sentence contains the relationship of interest, it focuses on actor
Hugh Jackman . The sentence “In 2004, after completing filming for The Machinist,Bale won the coveted role of Batman and his alter ego Bruce Wayne in ChristopherNolan’s Batman Begins. . . ”, in contrast, refers to the two entities and the relationshipof interest directly, resulting in a higher relevance grade. Our method, however, ranksthe first sentence on top, as it contains more phrases that refer to the relationship.
We have presented a method for explaining relationships between knowledge graph enti-ties with human-readable descriptions. We first extract and enrich sentences that refer toan entity pair and then rank the sentences according to how well they describe the rela-tionship. For ranking, we use learning to rank with a diverse set of features. Evaluationon a manually annotated dataset of “people” entities shows that our method significantlyoutperforms state-of-the-art sentence retrieval models for this task. Experimental resultsalso show that using relationship-dependent models is beneficial.In future work we aim to evaluate how our method performs on entities and relation-ships of any type and popularity, including tail entities and miscellaneous relationships.We also want to investigate moving beyond Wikipedia and extract candidate sentencesfrom documents that are not related to the knowledge graph, such as web pages or newsarticles. Employing such documents also implies an investigation into more advancedco-reference resolution methods.Our analysis showed that sentences may cover different relationships betweenentities or different aspects of a single relationship—we aim to account for such casesin follow-up work. Furthermore, sentences may contain unnecessary information forexplaining the relation of interest between two entities. Especially when we want toshow the obtained results to end users, we may need to apply further processing of thesentences to improve their quality and readability. We would like to explore sentencecompression techniques to address this. Finally, relationships between entities have aninherit temporal nature and we aim to explore ways to explain entity relationships andtheir changes over time.In this chapter, we studied the task of retrieving existing KG fact descriptions(explaining entity relationships). In the next chapter, we study how to generate suchdescriptions instead of retrieving existing ones.26
Generating Knowledge Graph Fact
Descriptions
In the previous chapter, we studied how to retrieve existing KG fact descriptions.However, a scenario where a description for a KG fact does not exist in the underlyingtext corpus is not unlikely. Therefore, in this chapter, we aim to answer
RQ2 : Given aKG fact, can we automatically generate a textual description of the fact in the absence ofan existing description? As in the previous chapter, we use the term “entity relationship”to refer to a KG fact.
Results displayed on a modern search engine result page (SERP) are sourced frommultiple, heterogeneous sources. For so-called organic results it has been known for along time that result snippets, i.e., brief descriptions explaining the result item and itsrelation to the query, positively influence the user experience [176]. In this chapter, wefocus on generating descriptions for results sourced from another important ingredientof modern SERPs: knowledge graphs. Knowledge graphs (KGs) contain informationabout entities and their relationships. A large and diverse set of search applicationsutilize KGs to improve the user experience. For instance, web search engines try toidentify KG entities in queries and augment their result pages with knowledge graphpanels that provide contextual entity information [22, 106]. Such panels usually focuson a single entity and may include attributes of the entity and other, related entities.Entities can be connected with more than one relationship in a KG, however. Forexample, two actors might have appeared in the same film, be born in the same countryand also be partners. Recent work has focused on finding relationships between apair of entities and ranking the relationships by a predefined relevance criterion [61].When using relationships in real-world search applications, with SERPs being the primeexample, a crucial problem is that they are typically represented in a formal mannerthat is not suitable to present to an end user. Instead, human-readable descriptions thatverbalize and provide context about entity relationships are more natural to use [66].
This chapter was published as [186]. . Generating Knowledge Graph Fact Descriptions They can be used, e.g., for entity recommendations [21] or for KG-based timelinegeneration [9].Descriptions of KG relationships themselves are usually not included in large-scaleknowledge graphs and previous work on automatically generating such descriptionshas either relied on hand-crafted templates [9] or on external text corpora [185]. Themain limitations of the former are that manually creating these templates is expensive,not generalizable, and thus it does not scale well. The latter approach is limited as theunderlying text corpus may not contain descriptions for all certain relationship instances;it will not produce meaningful results for instances that do not appear in the text corpus.We propose a method that overcomes these limitations by automatically generatingdescriptions of KG entity relationships. Since there exist textual descriptions of acertain relationship for some relationship instances, we aim to use these descriptionsto learn how the relationship is generally expressed in text and use this information togenerate descriptions for other instances of the same relationship. Existing relationshipdescriptions are usually complex and tailored to the entities they discuss. Also, itis likely that the KG does not contain all the information included in a description.For example, the KG might not contain any information about the second part of thefollowing sentence: “
Catherine Zeta-Jones starred in the romantic comedy The Rebound,in which she played a 40-year-old mother of two . . . ”. Nevertheless, descriptions ofthe same relationship share patterns that are specific to that relationship. Therefore, wefirst create sentence templates for a certain relationship and then, for a new relationshipinstance, we select appropriate templates, which we formulate as a ranking problem,and fill them with the appropriate entities to generate a description.We propose a method that generates descriptions of entity relationships for a rela-tionship instance given a knowledge graph and a set of relationship instances coupledwith their descriptions; we evaluate this method using an automatic and manual eval-uation method, and release the datasets used to the community. We show that wegenerate contextually rich relationship descriptions that are meant to be valid under theKG closed-world assumption. Moreover, our template-based method is naturally robustagainst KG incompleteness, since in the case of lack of contextual information aboutthe relationship instance, it can still generate a basic description.
Web search engine result pages (SERPs) can be augmented with information aboutthe query and the documents from KGs in order to improve the user experience [106].Also, SERPs can be augmented with textual descriptions and/or summaries with aprominent example being snippet generation for web search [176, 178]. Closest toour setting, relationship descriptions have been studied in the context of providingevidence for entity recommendation for web search [185] and timeline generation forknowledge base entities [9]. Our task, generating a description of a relationship instancegiven a KG, is similar to event headline generation, where the task is to generate ashort sentence that summarizes a specific event. Similar to our templates, the headlinepatterns constructed in [138] consist of words and entity slots. Our method differs https://github.com/nickvosk/ecir2017-gder-dataset/ .3. Problem Definition Table 3.1: Glossary.
Symbol Description K knowledge graph E set of entities P set of predicates (cid:104) s, p, o (cid:105) knowledge graph triple with s, o ∈ E and p ∈ P v word in vocabulary V a sentence r i relationship instance of relationship rT r set of templates t ∈ T r for relationship rR t set of relationship instances that support the template tX set of pairs (cid:104) r i (cid:48) , y (cid:48) (cid:105) , where y (cid:48) is a textual description (a single sentence) C mapping from an entity to an entity cluster K entity dependency graph of a sentence G compression graph P set of paths in G however, since relationships are more general than events and we thus have to deal withambiguity at generation time when selecting which template matches a relationshipinstance.Our task is also similar to concept-to-text generation, where the task is to generate atextual description given a set of database records [148]. In this context, our task is mostclosely related to [99, 159]. Saldanha et al. [159] use a template-based approach forgenerating company descriptions from Freebase. They construct sentence templates byreplacing the entities in existing sentences by the Freebase relation of the entity to thecompany (e.g., (cid:104) company (cid:105) was founded by (cid:104) f ounder (cid:105) ). They add a preprocessing stepwhere they remove phrases from the sentence that contain entities that are not connectedto the company directly. At generation time, the authors replace the entity slots withthe appropriate entities. Lebret et al. [99] propose a neural model to generate the firstsentence of a person’s biography in Wikipedia conditioned on Wikipedia infoboxes.Our setting is different from these papers since our generated descriptions are neitherrestricted to having entities that are directly connected to the subject entity in a KG norneed they be contained in a Wikipedia infobox. In this section we formally define the task of generating descriptions of entity relation-ships. Table 3.1 lists the main notation we use in this chapter.
Let E be a set of entities and P a set of predicates. A knowledge graph K is a set oftriples (cid:104) s, p, o (cid:105) , where s, o ∈ E and p ∈ P . We follow the closed-world assumption for29 . Generating Knowledge Graph Fact Descriptions x z y actor film film starring Figure 3.1: Graphical representation of the logical form of the starsInFilm relationship.Lambda variables are shown in circles and existential variables in rectangles. K and use Freebase as our knowledge graph [23, 128]. A sentence a is a sequence ofwords [ v , . . . , v n ] , where each v i ∈ a is also in V . Non-overlapping sub-sequences of a might refer to a single entity e ∈ E .A relationship r is a logical form in λ -calculus that consists of two lambda variables( x and y ), at least one predicate, and zero or one existential variables [208]. Lambdavariables can be substituted with Freebase entities, excluding compound value type(CVT) entities. Existential variables, on the other hand, can be substituted with Free-base entities, including CVT entities. For example, the logical form of the relationship starsInFilm is λx.λy. ∃ z.actor f ilm ( x, z ) ∧ f ilm starring ( z, y ) . Figure 3.1 showsthe equivalent graphical representation of this relationship.A pair r i = r (cid:104) s, o (cid:105) is a relationship instance of r for entities s, o ∈ E if by substitut-ing x = s and y = o in r and by executing the resulting logical form in the knowledgegraph K we get at least one result. For example, starsInFilm ( BradPitt , Troy ) is arelationship instance of the starsInFilm relationship. We assume that a relationship instance r i can be expressed with a human-readabledescription (such as a single sentence) that contains mentions of both s and o andpossibly other entities which may provide contextual information for the relationship r or the entities s and o . The task we address in this chapter is to generate such a textualdescription y of the relationship instance r i given the KG. For this we leverage a set ofpairs X , where each x ∈ X is a pair of r i (cid:48) and y (cid:48) , and y (cid:48) is the description of r i (cid:48) . Wedescribe how we obtain this set in Section 3.5.We aim to generate descriptions that are valid (expressing a relationship that canbe found in the knowledge graph under the closed-world assumption), natural (gram-matically correct), and informative, i.e., not just replicating the formal relationship butproviding additional contextual information where possible.We conclude our task definition with an example. Assume that we are giventhe relationship instance starsInFilm ( BradP itt, T roy ) . A possible description ofthis relationship instance is the following: “Brad Pitt appeared in the American epicadventure film Troy.” This description not only contains mentions of the entities ofthe relationship instance and a verbalization of the relationship (“appeared in”), butalso mentions of other entities that provide additional context. In particular, it containsmentions of Troy ’s type (
Film ), its genres (
Epic, Adventure ), and its countryof origin. CVT entities are special entities in Freebase that are used to model attributes of relationships (e.g., dateof marriage). .4. Generating Textual Descriptions Table 3.2: Additional surface forms per entity type.
Entity type Surface form
Person “he” or “she”, person’s surnameFilm “the film”Music album “the album”Music composition “the song”, “the track”
In this section we detail our method which consists of three main steps. First, we enrichthe description y (cid:48) for each pair (cid:104) r i (cid:48) , y (cid:48) (cid:105) ∈ X with additional entities from the KG(Section 3.4.1). Second, we use K and the set X to create a set of sentence templates T r for the relationship r (Section 3.4.2). Third, given a new relationship instance, weuse T r and K to generate a description (Section 3.4.3). In this step we perform entity linking to enrich the description y (cid:48) for each pair (cid:104) r i (cid:48) , y (cid:48) (cid:105) ∈ X with additional entities from the KG. This is done in order to facilitate the templatecreation step (Section 3.4.2). Each y (cid:48) is a sentence that is about an entity e ∈ E and inthe context of this chapter we obtain these sentences from Wikipedia as our KG providesexplicit links to Wikipedia articles. Although Wikipedia articles already contain explicitlinks to other articles and thus entities, these links are quite sparse. Therefore, we applyan algorithm for entity linking similar to [185].Since y (cid:48) originates from a Wikipedia article that is about a specific entity, we restrictthe candidate entities (i.e., the entities that we consider adding to enrich y (cid:48) ) to e itself,the in-links and out-links of the article of e in the Wikipedia structure, and the one-hopand two-hop neighbors of e in the KG. We infer the surface forms of each entity usingthe Wikipedia link structure, as is common in entity linking [118], and we also usethe aliases of each entity provided by the KG. In order to increase coverage for e , weenhance the set of surface forms of entity e using the rules in Table 3.2.We iterate over the n-grams of the sentence that are not yet linked to an entity indecreasing order of length; if the n-gram matches a surface form of a candidate entity,we link the n-gram to the entity. If multiple entity candidates exist for a surface form, werank the candidate entities by the number of entity neighbors they have in the sentenceand select the top-ranked entity. Because of the very restricted set of candidate entities,the linking is usually unambiguous (with only one entity candidate per surface form). We tag the sentences with POS tags and ignore unigram surface forms that are verbs. A manual evaluation of this algorithm on a held-out, random sample of 100 sentences in our datasetrevealed an average of 93% precision and 85% recall per sentence. . Generating Knowledge Graph Fact Descriptions Algorithm 1
Template creation
Require:
A set X , the knowledge graph K Ensure:
A set of templates T r X (cid:48) ← [] for (cid:104) r i (cid:48) , y (cid:48) (cid:105) ∈ X do K ← B UILD E NTITY D EPENDENCY G RAPH ( y (cid:48) , K ) X (cid:48) .append ( (cid:104) r i (cid:48) , y (cid:48) , K (cid:105) ) C ← C LUSTER E NTITIES ( X (cid:48) ) G ← B UILD C OMPRESSION G RAPH ( X (cid:48) , C ) P ← F IND V ALID P ATHS ( G ) T r ← {} for p ∈ P do t ← C ONSTRUCT T EMPLATE ( p, G, X (cid:48) ) if t (cid:54) = N U LL then T r .add ( t ) In this step, we create a set of templates T r for a relationship r using the KG and the setof (cid:104) r i (cid:48) , y (cid:48) (cid:105) pairs. The templates in T r will be used in the next step to generate a noveldescription for the relationship instance r i .A sentence template t is a tuple ( k, l, R t ) , where (i) k = [ u u . . . u n ] is a sequence,such that ∀ u i ∈ l : u i ∈ V ∪ E t , (ii) l is a logical form in λ -calculus that consists of allthe lambda variables in E t , at least one predicate and zero or more existential variables,and (iii) R t is a set of relationship instances that support t .The procedure we follow is outlined in Algorithm 1. First, we augment each (cid:104) r i (cid:48) , y (cid:48) (cid:105) pair with an entity dependency graph K in order to capture dependenciesbetween entities in a sentence (lines 1–4). Next, we build a mapping C that maps eachentity in each sentence to a single cluster id (line 5). This is done in order to facilitate thedetection of useful patterns in the sentences since each sentence describes a relationshipfor a particular entity pair. Then, we build a compression graph G (line 6) and use itto find valid paths P (line 7). Finally, for each path p ∈ P , we construct a template t and add it to the set of templates (lines 8–12). We now describe each procedure inAlgorithm 1. B UILD E NTITY D EPENDENCY G RAPH (.)
In order to build the graph K for a sentence y (cid:48) , we retrieve all paths between each pair of entities mentioned in y (cid:48) from the KG andadd them to K . We only consider 1-hop paths and 2-hop paths that pass through a CVTentity. Figure 3.2 shows the entity dependency graph for an example sentence. C LUSTER E NTITIES (.)
In order to obtain C , we consider all x (cid:48) = (cid:104) r i (cid:48) , y (cid:48) , K (cid:105) ∈ X (cid:48) and map two entities in the same cluster if they share at least one incoming or outgoingedge label in their corresponding entity dependency graph K . For example, in the starsInFilm relationship, this procedure will create separate clusters for persons, films,dates and CVT entities. B UILD C OMPRESSION G RAPH (.)
In this step, we build a compression graph G = .4. Generating Textual Descriptions Brad Pitt med
12 Years a Slave FilmDrama ac t o r . fi l m p e rf o r m a n ce . fi l m fi l m . s t a r r i n g p e rf o r m a n c e . a c t o r t yp e genre p r odu ce r . fi l m Figure 3.2: Entity dependency graph for the sentence “Brad Pitt appeared in the dramafilm 12 Years a Slave”. Nodes represent entities and edge labels represent predicates( med is a CVT entity). ( V, E ) using the sentence y (cid:48) of each (cid:104) r i (cid:48) , y (cid:48) , K (cid:105) ∈ X (cid:48) . V is a set of nodes and E is aset of edges. We follow a similar procedure to [63], in which each node holds a list of (cid:104) sid, pid (cid:105) pairs, where sid is a sentence id and pid is the index of the word/entity in thesentence. In our case a node can be a word or an entity cluster. We map two words ontothe same node if they have the same lowercase form and the same POS tag. We maptwo entities on the same node if they have the same cluster id. F IND V ALID P ATHS (.)
In order to find valid paths in the graph G , we set all the entitycluster nodes as valid start/end nodes and traverse G to find a set of paths P from a startto an end node. In order to build templates that are natural we enforce the followingconstraints for the paths in P : (i) the path must contain a verb and (ii) the path must havebeen seen as a complete sentence at least once in the input sentences. For example, giventhe following sentences (the corresponding cluster id per entity are listed in brackets):• y (cid:48) : “Bruce Willis[ c ] appeared in Moonrise Kingdom[ c ]”• y (cid:48) : “Liam Neeson[ c ] appeared in the action[ c ] film[ c ] Taken[ c ]”• y (cid:48) : “Brad Pitt[ c ] appeared in the drama[ c ] film[ c ] 12 Years a Slave[ c ]”we obtain the following valid paths by traversing the graph:• p : “ c appeared in c ”• p : “ c appeared in the c c c ” C ONSTRUCT T EMPLATE (.)
Algorithm 2 outlines the procedure for constructing atemplate t from a path p . First, for each (cid:104) r i (cid:48) , y (cid:48) , K (cid:105) ∈ X (cid:48) , we check whether y (cid:48) is a(possibly non-continuous) subsequence h of path p by using the positional informationof each node in p from G . If it is, we check whether h contains links to both thesubject and the object of the relationship instance r i (cid:48) . If it does, we store the entitydependency graph and the relationship instance. Next, if the number of instances is less For example, the path p is a subsequence of y (cid:48) . . Generating Knowledge Graph Fact Descriptions Algorithm 2 C ONSTRUCT T EMPLATE (.)
Require:
A path p , the compression graph G , a set X (cid:48) , parameters α, β Ensure:
A template t D g ← [] (cid:46) entity dependency graphs R t ← [] (cid:46) relationship instances that support the template for (cid:104) r i (cid:48) , y (cid:48) , K (cid:105) ∈ X (cid:48) do if I S S UBSEQUENCE ( p, y (cid:48) , G ) then h ← G ET S UBSEQUENCE ( p, y (cid:48) , G ) (cid:46) get the actual subsequence (cid:104) s, o (cid:105) ← r i (cid:48) (cid:46) subject/object of the relationship instance if C ONTAINS L INK ( h, s ) and C ONTAINS L INK ( h, o ) then D g .append ( K ) R t .append ( r i (cid:48) ) if | R t | < α then (cid:46) too few relationship instances return N U LL l ← B UILD L OGICAL F ORM ( D g , β ) (cid:46) aggregate the entity dependency graphs k ← R EPLACE C LUSTER I DS W ITH V ARIABLES ( p ) t = ( k, l, R t ) x subj z x obj x x ac t o r . fi l m p e rf o r m a n ce . fi l m fi l m . s t a rr i ng p e rf o r m a n ce . ac t o r t yp e genre Figure 3.3: Logical form of the template constructed using p and y (cid:48) , y (cid:48) , y (cid:48) (with theircorresponding relationship instances). k = “ x subj appeared in the x x x obj ”. Lambdavariables are shown in circles and existential variables in rectangles.than a parameter α , we consider the template to be invalid. Subsequently, we build thelogical form l by aggregating the entity dependency graphs D g . Entity nodes that werepart of the path p become lambda variables (nodes constructed from subject and objectentities have special identifiers). Entity nodes that were not part of the path p (CVTentities) become existential variables. We ignore edges appearing in less than | D g | · β entity dependency graphs. Lastly, we replace the cluster ids in p with the correspondinglambda variables to obtain a sequence k .Figure 3.3 shows the logical form of a template constructed using the examplesentences y (cid:48) , y (cid:48) and y (cid:48) and their corresponding instances in graphical form ( β = 0 . ).Note that the edge “producer.film” has been eliminated since it only appears in one outof the three instances.34 .4. Generating Textual Descriptions In this step we generate a novel description for a relationship instance r i using the setof templates T r and the knowledge graph K . This comes down to selecting the templatefrom T r that best describes the relationship instance r i and filling it with the appropriateentities.The procedure is as follows. First, we rank the templates in T r for the rela-tionship instance using a scoring function f ( r i , t ) . Subsequently, for each template t = ( k, l, R t ) we replace the subject and object lambda variables in l to obtain l (cid:48) = l [ x subj = s, x obj = o ] . We then query the knowledge graph K using l (cid:48) andif at least one instantiation of l (cid:48) exists, we randomly pick one and replace all theentity variables in k with the entity names to generate the description y , otherwisewe proceed to the next template. As an example, assume we are given the instance r i = starsInFilm ( Ryan Reynolds , Deadpool ) and we consider the template shownin Figure 3.3. A possible instantiation of the template for this relationship instance willresult in the description “Ryan Reynolds appeared in the comedy film Deadpool”. The template scoring function f ( r i , t ) returns a score for a relationship instance r i and template t . As we want to generate descriptions that are valid under the closed-world assumption of the KG, we promote templates that are semantically closest to therelationship instance. For a new relationship instance r i we extract binary features foreach entity in the r i . Recall that r i has two or more entities (subject s , object o andpossibly a CVT entity z ). For each entity e of r i , we extract all triples (cid:104) e, p, e (cid:48) (cid:105) fromthe KG K . We restrict the feature space by discriminating between entity attributes andentity relations depending on the predicate p as in [107]. If the predicate p is an attribute(e.g., “gender”), we use the complete triple as a feature (e.g. (cid:104) s, gender , female (cid:105) ). Ifthe predicate p is a relation (e.g., “date of death”), we only keep the subject and thepredicate of the triple as a feature (e.g., (cid:104) e, person . date of death (cid:105) ). We also add acount feature for the relation predicates (e.g., (cid:104) s, person . children , (cid:105) , i.e., a person hastwo children). We denote the resulting binary vector for r i as vec ( r i ) . We obtain avector vec ( t ) for template t by summing the vectors of all the instances R t of t . We alsocompute a vector vec tfidf ( t ) that is a TF.IDF weighted vector of vec ( t ) , where IDFis calculated at the template level. Based on these ingredients, we define two scoringfunctions:• Cosine
Calculates the cosine similarity between vectors vec ( r i ) and vec tf idf ( t ) .• Supervised
Learns a scoring function using a supervised learning to rank algo-rithm. We treat r i as a “query” and t as a “document.”We create training data for the supervised algorithm as follows. Recall that each r i iscoupled with a description y (cid:48) . For each r i , we assign a relevance label of 3 for templatesthat best match y (measured by the number of entities) and a relevance label of 2 for therest of the templates that match y . In order to create “negative” training data, we sampletemplates that are dissimilar to the ones that match y in the following way. First, wecalculate the average vector of all the templates that match y and build a distribution of Note that there might be multiple instantiations (e.g., Deadpool is also a science fiction film) and selectingthe optimal one depends on the application—we leave this for future work. . Generating Knowledge Graph Fact Descriptions templates based on the cosine distance from the average vector to each of the templatesin T r (excluding the ones that match y ). Lastly, we sample at most the number ofmatching templates from the resulting distribution and assign them a relevance label of1 (we ignore templates that have a cosine similarity to the average vector greater than . ). For the supervised model we use the following features: each element/value pairin vec ( r i ) , the cosine similarity between vectors vec ( r i ) and vec tfidf ( t ) , the words in t , the number of entities in t and the size of R t . We use LambdaMART [199] as thelearning algorithm and optimize for NDCG@1. In this section we describe the experimental setup we designed to answer
RQ2 . We use an English Wikipedia dump dated 5 February 2015 as our document corpus. Weperform sentence splitting and POS tagging using the Stanford CoreNLP toolkit. Weuse a subset of the last version of Freebase as our KG [23]: all the triples in the people,film and music domains, as these are well-represented in Freebase.In order to create an evaluation dataset for our task, we first need a set of KGrelationships. We rank the predicates in each domain by the number of instances andkeep the 10 top-ranked predicates. We exclude trivial predicates such as “dateOfDeath”.We then use the predicates to manually construct the logical forms of the relationships(see Figure 3.1 for an example). Second, we need a set of (cid:104) r i (cid:48) , y (cid:48) (cid:105) pairs for eachrelationship r , where r i (cid:48) = r (cid:104) s (cid:48) , o (cid:48) (cid:105) is an instance of relationship r , s (cid:48) and o (cid:48) are entitiesand y (cid:48) is a description of r i (cid:48) . To this end, for each relationship r , we randomly sample12000 relationship instances from the KG. For each relationship instance r i (cid:48) , we pickthe first sentence in the Wikipedia article of the subject entity s (cid:48) that contains links toboth s (cid:48) and o (cid:48) . If such a sentence does not exist, we proceed to the next instance. Wemanually inspected a subset of the sentences selected with this heuristic and the qualityof the selected sentences was relatively good. Our final dataset contains 10 relationshipsand 90058 (cid:104) r i (cid:48) , y (cid:48) (cid:105) instances in total and 8187 instances on average per relationship. Werandomly select 80% of each relationship sub-dataset for training and 20% for testing. We perform two types of evaluation: automatic and manual. For automatic evaluationwe use METEOR [97], ROUGE-L [105] and BLEU-4 [133] as metrics. METEORwas originally proposed in the context of machine translation but has also been usedin a task similar to ours [159]. ROUGE is a standard metric in summarization andBLEU is widely used in machine translation and generation. As is common in textgeneration [94], we also employ manual evaluation. We ask human annotators toannotate each output sentence on three dimensions: validity under the KG closed-world For this method we use 20% of the training data as validation data. The same test data is used for allmethods. .6. Results Table 3.3: Automatic evaluation results, averaged per relationship.
Method BLEU METEOR ROUGE
Random 1.14 16.56 24.13Most-freq 0.13 13.99 21.96Cosine 1.76 (cid:78) (cid:78)
Supervised (cid:78) (cid:78) (cid:78) assumption (0 or 1), informativeness (1–5) and grammaticality (1–5). One humanannotator (not one of the authors) annotated 11 generated sentences per relationship persystem (440 sentences in total).
We compare 4 variations of our method. The variations differ in the way they ranktemplates for a given relationship instance. The first variation (
Random ) ranks thetemplates randomly. The second (
Most-freq ) ranks templates by the number of relation-ship instances that support the template. The third (
Cosine ) ranks templates based onthe cosine similarity between the vectors of the relationship instance and the template(Section 3.4.3). The fourth (
Supervised ) ranks templates using a learning to rank model(Section 3.4.3), for which we use LambdaMART with the default number of trees (1000).We set α = 20 and β = 0 . (Section 3.4.3). We depict a significant improvement inperformance over Random with (cid:78) (paired two-tailed t-test, p < . ). In this section we describe our experimental results. We compare all methods discussedpreviously, using the automatic and manual setups, respectively.
Table 3.3 shows the automatic evaluation results. We observe that
Supervised and
Cosine outperform
Random and
Most-freq on all metrics. This is expected since theformer two try to capture the semantic similarity between a relationship instance anda template. Although
Supervised consistently outperforms
Cosine , the differencesbetween
Cosine and
Supervised are not significant.We also observe that the scores for the automatic measures are relatively low. Thisis because of two reasons: (i) we generally generate much shorter sentences thanthe reference sentence as not all information that appears in the reference sentence isrepresented in the KG, and (ii) since the reference sentences are extracted automatically,some of the reference sentences describe a minor aspect of the relationship or do notdiscuss the relationship at all. 37 . Generating Knowledge Graph Fact Descriptions
Table 3.4: Manual evaluation results, averaged per relationship.
Method Validity Informativeness Grammaticality
Random 0.4545 1.98 3.67Most-freq 0.5000 1.60 3.62Cosine 0.5636 (cid:78)
Supervised (cid:78) (cid:78)
Table 3.4 shows the results for manual evaluation. The results follow a similar trend asin the automatic evaluation;
Supervised and
Cosine outperform
Random and
Most-freq on all metrics.
Supervised significantly outperforms
Random in terms of validity andinformativeness. The differences between
Cosine and
Supervised are not significant.
We have also examined specific examples and identify cases where the best performingapproach (
Supervised ) succeeds or fails. In terms of validity, it succeeds in matchingattributes of the relationship instance and the template. E.g., in the context of therelationship parentOf , it correctly figures out what the genders of the entities are andthe semantically valid expression of the relationship between them, often better than
Cosine , as illustrated by the following example:(
Supervised ) “Emperor Francis I (1708 - 1765) was the father of Emperor Leopold II”(VALID)(
Cosine ) “Emperor Francis I was the son of Emperor Leopold II” (INVALID)
Supervised benefits from training a model that combines multiple features such asthe template words with attributes of the relationship instance to describe whether therelationship is still ongoing or not. One of the main cases where
Supervised fails is inranking a relationship instance in a temporal dimension with regards to other relationshipinstances, as illustrated by the following example for the childOf relationship:“Thomas Howard was the second son of Henry Howard and Frances de Vere.”(INVALID: Thomas Howard was the first son of Henry Howard)The fact that our best performing approach (
Supervised ) has a relatively low validityscore (0.5818) shows that there is room for improvement in capturing the semanticsimilarity between a relationship instance and a template.In terms of informativeness,
Supervised succeeds in offering contextual informationabout the relationship instance, such as dates, locations, occupations and film genres.The fact that informativeness scores are relatively low is because they are dependenton validity: when a generated sentence was assigned a validity of score 0, it was alsoassigned an informativeness score of just 1.38 .7. Conclusion
Grammaticality scores are high for all the systems with no significant differences.This is expected as the templates were generated using the same procedure for allthe compared systems. Mainly, grammaticality is harmed when some entities in thegenerated sentence have the wrong surface form (e.g., ‘Britain’, ‘British’), which is notsurprising as we do simple surface realization (deciding which surface form of the entitybest fits with the generated sentence) and only use the entity names as surface forms.
We have addressed the problem of generating descriptions of entity relationships fromKGs. We have introduced a method that first creates sentence templates for a specificrelationship, and then, for a new relationship instance, it generates a novel description byselecting the best template and filling the template slots with the appropriate entities fromthe KG. We have experimented with different scoring functions for ranking templatesfor a relationship instance and performed an automatic and a manual evaluation.When using information about the relationship instance and the template taken fromthe KG, both automatic and manual evaluation outcomes are improved. A supervisedmethod that uses both KG features and other template features (template words, numberof entities) consistently outperforms an unsupervised method on all automatic evaluationmetrics and also in terms of validity and informativeness.As to future work, our error analysis showed that we need more sophisticated model-ing for capturing the semantic similarity between a relationship instance and a template,especially for capturing temporal dimensions that also involve other relationship in-stances. We also want to explore more sophisticated methods for selecting the correctsurface form for an entity to improve grammaticality. Finally, we aim to evaluate ourmethod on generating descriptions for less popular KG relationships.In this chapter, we studied the task of generating KG fact (entity relationship)descriptions. In the next chapter, we move on to study how to contextualize KG factsusing other, related KG facts. 39
Contextualizing Knowledge Graph Facts
In Chapter 2 and 3, we studied how retrieve and generate descriptions of knowledgegraph (KG) facts. KG fact descriptions often contain mentions to other, related KGfacts that are not trivial to find given the large size of KGs. In this chapter, we address
RQ3 : Can we contextualize a KG query fact by retrieving other, related KG facts?
Knowledge graphs (KGs) have become essential for applications such as search, queryunderstanding, recommendation and question answering because they provide a unifiedview of real-world entities and the facts (i.e., relationships) that hold between them [21,22, 120, 208]. For example, KGs are increasingly being used to provide direct answers touser queries [208], or to construct so-called entity cards that provide useful informationabout the entity identified in the query. Recent work [25, 75] suggests that search engineusers find entity cards useful and engage with them when they contain information thatis relevant to their search task, for instance in the form of a set of recommended entitiesand facts that are related to the query [21]. Previous work has focused on augmentingentity cards with facts that are centered around, i.e., one-hop away from, the main entityof the query [75].However, oftentimes a user is interested in KG facts that by definition involve morethan one entity (e.g., “Who founded Microsoft?” −→ “Bill Gates”). In such cases, wecan exploit the richness of the KG by providing query-specific additional facts thatincrease the user’s understanding of the fact as a whole, and that are not necessarilycentered around only one of the entities. Additional relevant facts for the runningexample would include Bill Gates’ profession, Microsoft’s founding date, its mainindustry and its co-founder Paul Allen (see Figure 4.1). In this case, Bill Gates’ personallife is less relevant to the fact that he founded Microsoft.Query-specific relevant facts can also be used in other applications to enrich theuser experience. For instance, they can be used to increase the utility of KG questionanswering (QA) systems that currently only return a single fact as an answer to anatural language question [15, 208]. Beyond QA, systems that focus on automaticallygenerating natural language from KG facts [99] would also benefit from query-specific This chapter was published as [187]. . Contextualizing Knowledge Graph Facts Bill Gates MicrosoftSoftware
Programmer
Paul Allen 1975-04 founderOf i ndu s t r y dateFounded f ound e r O f professionprofession Figure 4.1: A Freebase subgraph that consists of relevant facts to the query fact founderOf ( Bill Gates , Microsoft ) .relevant facts, which can make the generated text more natural and human-like. Thisbecomes even more important for KG facts that involve tail entities, for which naturallanguage text might not exist for training [186].In this chapter, we address the task of KG fact contextualization , that is, given a KGfact that consists of two entities and a relation that connects them, retrieve additionalfacts from the KG that are relevant to that fact. This task is analogous to ad-hoc retrieval:(i) the “query” is a KG fact, (ii) the “documents” are other facts in the KG that are inthe neighborhood of the “query”. We propose a neural fact contextualization method (NFCM), a method that first generates a set of candidate facts that are part of { } -hoppaths from the entities of the main fact. NFCM then ranks the candidate facts by howrelevant they are for contextualizing the main fact. We estimate our learning to rankmodel using supervised data. The ranking model combines (i) features we automaticallylearn from data and (ii) those that represent the query-candidate facts with a set of hand-crafted features we devised or adjusted for this task. Due to the size and heterogeneousnature of KGs, i.e., the large number of entities and relationship types, we turn to distantsupervision to gather training data. Using another, human-verified test collection wegauge the performance of our proposed method and compare it with several baselines.We sum up our contributions as follows.• We introduce the task of KG fact contextualization where the goal is to, given afact that consists of two entities and a relationship that connects them, rank otherfacts from a KG that are relevant to that fact.• We propose NFCM, a method to solve KG fact contextualization using distantsupervision and learning to rank. Our results show that: (i) distant supervisionis an effective means for gathering training data for this task and (ii) a neurallearning to rank model that is trained end-to-end outperforms several baselineson a human-curated evaluation set.• We provide a detailed result analysis and insights into the nature of our task.The remainder of this chapter is organized as follows. We first provide a definitionof our task in Section 4.2 and then introduce our method in Section 4.3. We describe42 .2. Problem Statement BarackObama M1 MichelleObama1992-10Hawaii spouse spouse m a rr i a g e D a t e bo r n I n Figure 4.2: KG subgraph that consists of three facts: bornIn (cid:104)
Barack Obama , Hawaii (cid:105) , spouseOf (cid:104) Barack Obama , Michelle Obama (cid:105) and marriageDate (cid:104) M1 , (cid:105) . M1is a CVT entity. Note that the third fact is an attribute of the second fact.our experimental setup and detail our results and analyses in Sections 4.4 and 4.5,respectively. We conclude with an overview of related work and an outlook on futuredirections. In this section we provide background definitions and formally define the task of KGfact contextualization.
Let E = E n ∪ E c be a set of entities, where E n and E c are disjoint sets of non-CVTand CVT entities, respectively. Furthermore, let P be a set of predicates. A knowledgegraph K is a set of triples (cid:104) s, p, o (cid:105) , where s, o ∈ E and p ∈ P . By viewing each triplein K as a labelled directed edge, we can interpret K as a labelled directed graph. Weuse Freebase as our knowledge graph [23, 128].A path in K is a non-empty sequence (cid:104) s , p , t (cid:105) , . . . , (cid:104) s m , p m , t m (cid:105) of triples fromK such that t i = s i +1 for each i ∈ , m − .We define a fact as a path in K that either: (i) consists of 1 triple, s ∈ E and t ∈ E n (i.e., s may be a CVT entity), or (ii) consists of 2 triples, s , t ∈ E n and t = s ∈ E c (i.e., t = s must be a CVT entity). A fact of type (i) can be an attributeof a fact of type (ii), iff they have a common CVT entity (see Figure 4.2 for an example).Let R be a set of relationships where a relationship r ∈ R is a label for a set of factsthat share the same predicates but differ in at least one entity. For example, spouseOf is the label of the fact depicted in the top part of Figure 4.2 and consists of two triples.Our definition of a relationship corresponds to direct relationships between entities, i.e.,one-hop paths or two-hop paths through a CVT entity. For the remainder of this chapter,we refer to a specific fact f as r (cid:104) s, t (cid:105) , where r ∈ R and s, t ∈ E . Compound Value Type (CVT) entities are special entities frequently used in KGs such as Freebase andWikidata to model fact attributes. See Figure 4.2 for an example. . Contextualizing Knowledge Graph Facts Algorithm 3
Fact enumeration for a given query fact f q . Require:
A query fact f q = r (cid:104) s, t (cid:105) Ensure:
A set of candidate facts F F ← {} for e ∈ { s, t } do for n ∈ G ET O UT N EIGHBORS ( e ) + G ET I N N EIGHBORS ( e ) do F.addAll ( G ET F ACTS ( e, n ) ) if I S C LASS O R T YPE ( n ) then continue for n ∈ G ET O UT N EIGHBORS ( n ) do F.addAll ( G ET F ACTS ( n, n ) ) for n ∈ G ET I N N EIGHBORS ( n ) do F.addAll ( G ET F ACTS ( n , n ) ) return F Given a query fact f q and a KG K , we aim to find a set of other, relevant facts from K .Specifically, we want to enumerate and rank a set of candidate facts F = { f c : f c ⊆ K, f c (cid:54) = f q } based on their relevance to f q . A candidate fact f c is relevant to the queryfact f q if it provides useful and contextual information. Figure 4.1 shows an examplepart of our KG that is relevant to the query fact founderOf (cid:104) Bill Gates , Microsoft (cid:105) .Note that a candidate fact does not have to be directly connected to both entitiesof the query fact to be relevant, e.g., profession (cid:104)
Paul Allen , Programmer (cid:105) . Simi-larly, a fact can be related to one or more entities in the relationship instance, e.g., parentOf (cid:104)
Bill Gates , Jennifer Katharine Gates (cid:105) , but not provide any context, thus be-ing considered irrelevant.
In this section we describe our proposed neural fact contextualization method (NFCM)which works in two steps. First, given a query fact f q , we enumerate a set of candidatefacts F = { f c : f c ⊆ K } (see Section 4.3.1). Second, we rank the facts in F byrelevance to f q to obtain a final ranked list F (cid:48) using a supervised learning to rank model(see Section 4.3.2). We describe how we use distant supervision to automatically gatherthe required annotations to train the supervised learning to rank model in Section 4.4.3. In this section we describe how we obtain the set of candidate facts F from K given aquery fact f q = r (cid:104) s, t (cid:105) . Because of the large size of real-world KGs—which can easilycontain upwards of 50 million entities and 3 billion facts [134]— it is computationallyinfeasible to add all possible facts of K in F . Therefore, we limit F to the set of facts44 .3. Method Bill Gates B&M G.Foundation cvt MelindaGatesMicrosoftSoftwarePaul Allen Programmer1975-041955-10 1964-08Lanai1994-01JenniferGates cvt CEO 1975-042000-01 s pou s e s p o u s e f ound e r O f i ndu s t r y founderOf professionprofession d a t e F ound e d d a t e O f B i r t h d a t e O f B i r t h ce r e m ony A t marriageDate p a r e n t O f p a r e n t O f founderOf founderOf l ea d e r O f l ea d e r s h i p l ea d e r s from t o Figure 4.3: Graph with a subset of the facts that are enumerated for the query fact spouseOf ( Bill Gates , Melinda Gates ) . The entities of the query fact are shaded.that are in the broader neighborhood of the two entities s and t . Intuitively, facts thatare further away from the two entities of the query fact are less likely to be relevant.The procedure we follow is outlined in Algorithm 3. This algorithm enumerates thecandidate facts for f q = r (cid:104) s, t (cid:105) that are at most 2 hops away from either s or t . Threeexceptions are made to this rule: (i) CVT entities are not counted as hops, (ii) we do notinclude f q in F as it is trivial, and (iii) to reduce the search space, we do not expandintermediate neighbors that represent an entity class or a type (e.g., “actor”) as thesecan have millions of neighbors. Figure 4.3 shows an example graph with a subset of thefacts that we enumerate for the query fact spouseOf (cid:104) Bill Gates , Melinda Gates (cid:105) usingAlgorithm 3.
Next, we describe how we rank the set of enumerated candidate facts F with respect totheir relevance to the query fact f q = r (cid:104) s, t (cid:105) . The overall methodology is as follows. Foreach candidate fact f c ∈ F , we create a pair ( f q , f c ) —an analog to a query-documentpair—and score it using a function u : ( f q , f c ) → [0 , ∈ R (higher values indicatehigher relevance). We then obtain a ranked list of facts F (cid:48) by sorting the facts in F based on their score.We begin by describing the training procedure we follow and continue with thenetwork architecture we use for learning our scoring function u . 45 . Contextualizing Knowledge Graph Facts Learning procedure
We train a network that learns the scoring function u ( f q , f c ) end-to-end in mini-batches using stochastic gradient descent (we define the networkarchitecture below). We optimize the model parameters using Adam [92]. Duringtraining we minimize a pairwise loss to learn the function u , while during inference weuse the learned function u to score a query-candidate fact pair ( f q , f c ). This paradigmhas been shown to outperform pointwise learning methods in ranking tasks, whilekeeping inference efficient [49]. Each batch B consists of query-candidate fact pairs( f q , f c ) of a single query fact f q . For constructing B for a query fact f q , we use allpairs ( f q , f c ) that are labeled as relevant and sample k pairs ( f q , f c ) that are labeled asirrelevant. During training, we minimize the mean pairwise squared error between allpairs of ( f q , f c ) in B × B : L ( B, θ ) = 1 | B | (cid:88) (cid:104) x ,x (cid:105)∈ B × B ([ l ( x ) − l ( x )] − [ u ( x ) − u ( x )]) , (4.1)where x = ( f q , f c ) and x = ( f q , f c ) are query-candidate fact pairs in the set B × B , l ( x ) ∈ { , } is the relevance label of a query-candidate fact pair x , | B | is the batchsize, and θ are the parameters of the model which we define below. Network architecture
Figure 4.4 shows the network architecture we designed forlearning the scoring function u ( f q , f c ) . We encode the query fact f q in a vector v q using an RNN. As we will explain further in that section, we do not model the entitiesin the facts independently due to the large number of entities; instead, we model eachentity as an aggregation of its types. Therefore, instead of modeling the candidatefact f c in isolation and losing per-entity information, we first enumerate all the pathsup to two hops away from both the entities of the query fact f q ( s and t ) to all theentities of the candidate fact f c ( s (cid:48) and t (cid:48) ). Let A s denote the set of paths from s toall the entities of f c . Let A t denote the set of paths from t to all the entities of f c .For each A ∈ { A s , A t } , we first encode all the paths in A using an RNN, and thencombine the resulting encoded paths using the procedure described later in this section.We denote the vectors obtained from the above procedure for A s and A t as v as and v at , respectively. Then we obtain a vector v a = [ v as , v at ] , where [ · , · ] denotes theconcatenation operation (middle part of Figure 4.4). Note that we use the same RNNparameters for all the above operations. To further inform the scoring function, wedesign a set of hand-crafted features x (right-most part of Figure 4.4). We detail thehand-crafted features later in this section.Finally, MLP-o ([ v q , v a , x ]) is a multi-layer perceptron with α hidden layers ofdimension β and one output layer that outputs u ( f q , f c ) . We use a ReLU activationfunction in the hidden layers and a sigmoid activation function in the output layer. Wevary the number of layers to capture non-linear interactions between the features in v q , v a , and x .The remainder of this section describes how we encode a single fact, how wecombine the representations of a set of facts, and, finally, the hand-crafted features.46 .3. Method ! " ! " → ! % (from & ) RNN RNN RNNsum ' concatMLP-oScore s)& s)a s)b t)a t)- Feature vector …… RNN RNNsum ! " → ! % (from . ) Figure 4.4: Network architecture that learns a scoring function u ( f q , f c ) . Given aquery fact f q = r (cid:104) s , t (cid:105) and a candidate fact f c = r (cid:48) (cid:104) a , b (cid:105) it outputs a score u ( f q , f c ) .“ f q → f c (from e )” is a label for the paths that start from an entity e of the query fact(either s or t ) and end at an entity e (cid:48) of the candidate fact f c . Note that p is a variable inthis figure, i.e., it might refer to different predicates. Encoding a single fact
Recall from Section 4.2.1 that a fact f is a path in the KG. In order to model pathswe turn to neural representation learning. More specifically, since paths are sequentialby nature we employ recurrent neural networks (RNNs) to encode them in a singlevector [48, 71]. This type of modeling has proven successful in predicting missinglinks in KGs [48]. One restriction that we have in modeling such paths is the verylarge number of entities ( ∼ . million entities in our dataset) and, since learningan embedding for such large numbers of entities requires prohibitively large amountsof memory and data, we represent each entity using an aggregation of its types [48].Formally, let W z denote a | Z | × d z matrix, where each row is an embedding of anentity type z , | Z | is the number of entity types in our dataset and d z is the entitytype embedding dimension. Let W p denote a | P | × d p matrix, where each row is anembedding of a predicate p , | P | is the number of predicates in our dataset, and d p isthe predicate embedding dimension. In order to model inverse predicates in paths (e.g.,47 . Contextualizing Knowledge Graph Facts Table 4.1: Notation
Name Description Definition
NumTriples
Number of triples in K |{(cid:104) s, p, t (cid:105) : (cid:104) s, p, t (cid:105) ∈ K }| TriplesPred ( p ) Set of triples that havepredicate p {(cid:104) s, p (cid:48) , t (cid:105) : (cid:104) s, p (cid:48) , t (cid:105) ∈ K, p (cid:48) = p } TriplesEnt ( e ) Set of triples that haveentity e {(cid:104) s, p, t (cid:105) : (cid:104) s, p, t (cid:105) ∈ K, s = e ∨ t = e } TriplesSubj ( e ) Set of triples that haveentity e as subject {(cid:104) s, p, t (cid:105) : (cid:104) s, p, t (cid:105) ∈ K, s = e } TriplesObj ( e ) Set of triples that haveentity e as object {(cid:104) s, p, t (cid:105) : (cid:104) s, p, t (cid:105) ∈ K, t = e } UniqEnt ( T ) The unique set of enti-ties in a set of triples T (cid:83) {{ s, t } : (cid:104) s, p, t (cid:105) ∈ T } Types ( e ) The set of types of en-tity e { z : (cid:104) e, type , z (cid:105) ∈ K } Entities ( f ) The set of entities offact f (cid:83) {{ s, t } : ∀(cid:104) s, p, t (cid:105) ∈ f } Preds ( f ) The set of predicates offact f { p : (cid:104) s, p, t (cid:105) ∈ f } Microsoft → founderOf − → Paul Allen), we also define a | P | × d p matrix W p i ,which corresponds to embeddings of the inverse of each predicate [71].The procedure we follow for modeling a fact f is as follows. For simplicity inthe notation, in this Section we denote a path as a sequence of alternate entities andpredicates [ s , p , . . . t m ] , instead of a sequence of triples as defined in Section 4.2.1.For each entity e ∈ f , we first retrieve the types of e in K . From these, we only keep the7 most frequent types in K , which we denote as Z e [48]. We then project each z ∈ Z e to its corresponding type embedding w z ∈ W z and perform element-wise sum on theseembeddings to obtain an embedding w e for entity e . We project each predicate p ∈ f toits corresponding embedding w p ( w p ∈ W p i if p is inverse, w p ∈ W p otherwise).The resulting projected sequence X f = [ w s , w p , . . . , w t m ] is passed to a uni-directional recurrent neural network (RNN). The RNN has a sequence of hidden states [ h , h , . . . , h n ] , where h i = tanh ( W hh h i − + W xh x i ) , and W hh and W xh arethe parameters of the RNN. The RNN is initialized with zero state values. We use thelast state of the RNN h n as the representation of the fact f . Combining a set of facts
We obtain the representation of the set of encoded facts using element-wise summationof the encoded facts (vectors). We leave more elaborate methods for combining factssuch as attention mechanisms [12, 48] for future work.48 .3. Method
Hand-crafted features
Here, we detail the hand-crafted features x we designed or adjusted for this task.Table 4.1 lists the notation we use. We generate features based on feature templatesthat are divided into three groups: (i) those that give us a sense of importance of a fact,(ii) those that give us a sense of relevance of ( f q , f c ) , and (iii) a set of miscellaneousfeatures. Note that we use log-computations to avoid underflows. (i) Fact importance This group of feature templates give us a sense on how importanta fact f is when taking statistics of the knowledge graph K into account at a global level.Note that we calculate these features for both facts f q and f c . The first of these featuretemplates measures normalized predicate frequency of each predicate p that participatesin fact f (we also include the minimum, maximum and average value for each fact asmetafeatures [24]). This is defined as the ratio of the size of the set of triples that havepredicate p in the KG to the total number of triples: PredFreq ( p ) = | TriplesPred ( p ) | NumTriples . (4.2)The second feature template is the normalized entity frequency for each entity e thatparticipates in fact f (we also include the minimum, maximum and average value foreach fact as metafeatures). This is defined as the ratio of the number of triples in which e occurs in the KG over the number of triples in the KG: EntFreq ( e ) = | TriplesEnt ( e ) | NumTriples . (4.3)The final feature template in this feature group is path informativeness , proposedby Pirr`o [139], which we apply for both f q and f c (recall from Section 4.2.1 that afact f is a path in the KG). This feature is analog to TF.IDF and aims to estimate theimportance of predicates for an entity. The informativeness of a path π is defined asfollows [139]: I ( π ) = 12 | π | (cid:88) (cid:104) s,p,t (cid:105)∈ π PFITF out ( p, s, K ) + PFITF in ( p, t, K ) , (4.4)where: PFITF x ( p, e, K ) = PF x ( p, e ) ∗ ITF ( p ) , x ∈ { in, out } , where ITF ( p ) is the inverse triple frequency of predicate p : ITF ( p ) = log NumTriples | TriplesPred ( p ) | , PF out ( p, e ) is the outgoing predicate frequency of e when p is the predicate: PF out ( p, e ) = | TriplesSubj ( e ) ∩ TriplesPred ( p ) || TriplesSubj ( e ) | , and PF in ( p, e ) is the incoming predicate frequency of e when p is the predicate: PF in ( p, e ) = | TriplesObj ( e ) ∩ TriplesPred ( p ) || TriplesObj ( e ) | . . Contextualizing Knowledge Graph Facts (ii) Relevance This group of feature templates gives us signal on the relevance of acandidate fact f c w.r.t. the query fact f q . The first of these feature templates measures entity similarity for each pair ( e , e ) ∈ Entities ( f q ) × Entities ( f c ) (we also includethe minimum, maximum and average entity similarity as metafeatures). We measureentity similarity using type-based Jaccard similarity: EntTypeSim ( e , e ) = JaccardSim ( Types ( e ) , Types ( e )) . (4.5)The next feature template in the relevance category is entity distance , which allowsus to reason about the distance of two entities ( e , e ) ∈ Entities ( f q ) × Entities ( f c ) (we also include the minimum, maximum and average entity distance as metafeatures).This feature is defined as the length of the shortest path between e and e in K . Theintuition is that we can get a signal for the relevance of f c by measuring how “close”the entities in f c are to the entities of f q in the KG.The next set of features measure predicate similarity between every pair of predicates ( p , p ) ∈ Preds ( f q ) × Preds ( f c ) (we also include the minimum, maximum andaverage predicate similarity as metafeatures). The intuition is that if f c has predicatesthat are highly similar to the predicates in f q , then f c might be relevant to f q . Wemeasure predicate similarity in two ways. First, by measuring the co-occurrence ofentities that participate in the predicates p and p : PredCooccSim ( p , p ) = (4.6) JaccardSim ( UniqEnt ( TriplesPred ( p )) , UniqEnt ( TriplesPred ( p ))) . For instance,
PredCooccSim ( p , p ) would be high for p starredIn and p directedBy .Second, by measuring the jaccard similarity of the set of predicates in f q with theset of predicates in f c [139]: SetPredicatesJaccardSim ( f q , f c ) = (4.7) JaccardSim ( Preds ( f q ) , Preds ( f c )) . Finally, we add a binary feature that captures whether f q and f c have the same CVTentity, i.e., f c is an attribute of f q . (iii) Miscellaneous This set of features includes whether f q has a CVT entity (samefor f c ). We also include whether an entity is a date (for all entities of f q and f c ). Finally,we include the concatenation of the predicates of f q as a feature using one-hot encoding. In this section we describe the setup of our experiments that aim to answer
RQ3 , whichwe break down to the following research sub-questions:
RQ3.1
How does NFCM perform compared to a set of heuristic baselines on a crowd-sourced dataset?50 .4. Experimental Setup
Table 4.2: Examples of relationships used in this work.
Domain Relationship
People spouseOf ( person , person ) parentOf ( person , person ) educatedAt ( person , organization ) Business founderOf ( person , organization ) boardMemberOf ( person , organization ) leaderOf ( person , organization ) Film starredIn ( person , film ) directorOf ( person , film ) producerOf ( person , film ) RQ3.2
How does NFCM perform compared to a scoring function that scores candi-date facts w.r.t. a query fact using the relevance labels gathered from distantsupervision on a crowdsourced dataset?
RQ3.3
Does NFCM benefit from both the handcrafted features and the automaticallylearned features?
RQ3.4
What is the per-relationship performance of NFCM? How does the number ofinstances per relationship affect the ranking performance?
We use the latest edition of Freebase as our knowledge graph [23]. We include Freebaserelations from the following set of domains:
People, Film, Music, Award, Government,Business, Organization, Education . Following previous work [122], we exclude triplesthat have an equivalent reversed triple.
Our dataset consists of query facts, candidate facts, and a relevance label for eachquery-candidate fact pair. In order to construct our evaluation dataset we need to startwith a set of relationships. Given that most of our domains are people-centric, we obtainthis set by extracting all relationships from Freebase that have an entity of type
Person as one of the entities. In the end, we are left with 65 unique relationships in total (seeTable 4.2 for example relationships). We then proceed to gather our set of query facts.For each relationship, we sample at most 2,000 query facts, provided that they haveat least one relevant fact after applying the procedure described in Section 4.4.3. Intotal, the dataset contains 62,044 query facts (954.52 on average per relationship). Aftergathering query facts for each relation, we enumerate candidate facts for each queryfact using the procedure described in Section 4.3.1. Finally, we randomly split thedataset per relationship (70% of the query facts for training, 10% for validation, 20%for testing). Table 4.3 shows statistics of the resulting dataset. 51 . Contextualizing Knowledge Graph Facts
Table 4.3: Statistics of the dataset gathered using distant supervision (see Section 4.4.3).
Part average median max. min.Training 44,632 1,420 741 9,937 2Validation 4,983 1,424 749 9,796 3Test 12,429 1,427 771 9,924 3Note that we train and tune the fact ranking models with the training and validationsets in Table 4.3 respectively, using the automatically gathered relevance labels (seeSection 4.4.3). The test set was only used for preliminary experiments (not reported)and for constructing our manually curated evaluation dataset (see Section 4.4.4). Wedescribe how we automatically gather noisy relevance labels for our dataset in the nextsection.
Gathering relevance labels for our task is challenging due to the size and heterogeneousnature of KGs, i.e., having a large number of facts and relationship types. Therefore, weturn to distant supervision [122] to gather relevance labels at scale. We choose to get asupervision signal from Wikipedia for the following reasons: (i) it has a high overlapof entities with the KG we use, and (ii) facts that are in KGs are usually expressed inWikipedia articles alongside other, related facts. We filter Wikipedia to select articleswhose main entity is in Freebase, and the entity type corresponds to one of the domainslisted in Section 4.4.1. This results in a set of 1,743,191 Wikipedia articles.The procedure we follow for gathering relevance labels given a query fact f q and itsset of candidate facts F is as follows. For a query fact f q = r (cid:104) s, t (cid:105) , we focus on theWikipedia article of entity s . First, as Wikipedia style guidelines dictate that only the firstmention of another entity should be linked, we augment the articles with additional entitylinks using an entity linking method proposed in [186]. Next, we retain only segmentsof the Wikipedia article that contain references to t . Here, a segment refers to thesentence that has a reference to t and also one sentence before and one after the sentence.For each such extracted segment, we assume that it expresses the fact f q , which is acommon assumption in gathering noisy training data for relation extraction [122]. Fromthe segments, we then collect a set of other entities, O , that occur in the same sentencethat mentions t : for computational efficiency, we enforce | O | ≤ . Then, we extractfacts for all possible pairs of entities (cid:104) e , e (cid:105) ∈ { O ∪ { s, t }} × { O ∪ { s, t }} . If there isa single fact f c in K that connects e and e , we deem f c relevant for f q . However, ifthere are multiple facts connecting e and e in K , the mention of the fact in the specificsegment is ambiguous and thus we do not deem any of these facts as relevant [170].The rest of the facts in F are deemed irrelevant for f q .The distribution of relevant/non-relevant labels in the distantly supervised data isheavily skewed: out of 87,998,956 facts in total, only 225,032 are deemed to be relevant(0.26%). This is expected since the candidate fact enumeration step can generate52 .4. Experimental Setup thousands of facts for a certain query fact (see Section 4.3.1).As a sanity check, we evaluate the performance of our approach to collect distantsupervision data by sampling 5 query facts for each relation in our dataset. For thesequery facts, we perform manual annotations on the extracted candidate facts that weredeemed as relevant by the distant supervision procedure. We obtain an overall precisionof 76% when comparing the relevance labels of the distant supervision against ourmanual annotations. This demonstrates the potential of our distant supervision strategyfor creating training data. In order to evaluate the performance of NFCM on the KG fact contextualization task, weperform crowdsourcing to collect a human-curated evaluation dataset. The procedurewe use to construct this evaluation dataset is as follows. First, for each of the 65relationships we consider, we sample five query facts of the relationship from the testset (see Section 4.4.2). Since fact enumeration for a query fact can yield hundreds orthousands of facts (Section 4.3.1), it is infeasible to consider all the candidate factsfor manual annotation. Therefore, we only include a candidate fact in the set of factsto be annotated if: (i) the candidate fact was deemed relevant by the automatic datagathering procedure (Section 4.4.3), or (ii) the candidate fact matches a fact pattern thatis built using relevant facts that appear in at least 10% of the query facts of a certainrelationship. An example fact pattern is parentOf (cid:104) ? , ? (cid:105) , which would match the fact parentOf (cid:104) Bill Gates , Jennifer Gates (cid:105) .We use the CrowdFlower platform, and ask the annotators to judge a candidate factw.r.t. its relevance to a query fact. We provide the annotators with the following scenario(details omitted for brevity):
We are given a specific real-world fact, e.g., “Bill Gates is the founderof Microsoft”, which we call the query fact. We are interested in writinga description of the query fact (a sentence or a small paragraph). Thepurpose of this assessment task is to identify other facts that could beincluded in a description of the query fact. Note that even though all factspresented for assessment will be accurate, not all will be relevant or equallyimportant to the description of the main fact.
We ask the annotators to assess the relevance of a candidate fact in a 3-graded scale:• very relevant : I would include the candidate fact in the description of the queryfact; the candidate fact provides additional context to the query fact.• somewhat relevant : I would include the candidate fact in the description of thequery fact, but only if there is space.• irrelevant : I would not include the candidate fact in the description of the queryfact.Alongside each query-candidate fact pair, we provide a set of extra facts that couldpossibly be used to decide on the relevance of a candidate fact. These include facts that53 . Contextualizing Knowledge Graph Facts
Table 4.4: Relevance label distribution of the crowdsourced evaluation dataset.
Relevance Non-attribute facts (%) Attribute facts (%)
Irrelevant 60.86 34.34Somewhat relevant 34.49 57.81Very relevant 4.63 7.84connect the entities in the query fact with the entities in the candidate fact. For example,if we present the annotators with the query fact spouseOf (cid:104)
Bill Gates, Melinda Gates (cid:105) and the candidate fact parentOf (cid:104)
Melinda Gates, Jennifer Gates (cid:105) we also show the fact parentOf (cid:104)
Bill Gates, Jennifer Gates (cid:105) .Each query-candidate fact pair is annotated by three annotators. We use majorityvoting to obtain the gold labels, breaking ties arbitrarily. The annotators get a paymentof 0.03 dollars per query-candidate fact pair.By following the crowdsourcing procedure described above, we obtain 28,281 factjudgments for 2,275 query facts (65 relations, 5 query facts each). Table 4.4 detailsthe distribution of the relevance labels. One interesting observation is that facts thatare attributes of other facts (see Section 4.2.1) tend to have relatively more relevantjudgments than the ones that are not. This is expected since some of them are attributesof the query fact (e.g., date of marriage for a spouseOf query fact). Finally, Fleiss’kappa is κ = 0.4307, which is considered moderate agreement. Note that all the resultsreported in Section 4.5 are on the manually curated dataset described here. Evaluation metrics
We use the following standard retrieval evaluation metrics: MAP,NDCG@5, NDCG@10 and MRR. In the case of MAP and MRR, which expect binarylabels, we consider “very relevant” and“somewhat relevant” as “relevant”. We report onstatistical significance with a paired two-tailed t-test.
To the best of our knowledge, there is no previously published method that addressesthe task introduced in this chapter. Therefore, we devise a set of intuitive baselines thatare used to showcase that our task is not trivial. We derive them by combining featureswe introduced in Section 4.3.2. We define these heuristic functions below:•
Fact informativeness (FI).
Informativeness of the candidate fact f c [139, Eq. 4.4].This baseline is independent of f q .• Average predicate similarity (APS).
Average predicate similarity of all pairs ofpredicates ( p , p ) ∈ Preds ( f q ) × Preds ( f c ) (Eq. 4.6). The intuition here isthat f c might be relevant to f q if it contains predicates that are similar to thepredicates of f q .• Average entity similarity (AES).
Average entity similarity of all pairs of entities in ( e , e ) ∈ Entities ( f q ) × Entities ( f c ) (Eq. 4.5). The assumption here is that f c might be relevant to f q if it contains entities that are similar to the entities of f q .54 .5. Results and Discussion The models described in Section 4.3.2 are implemented in TensorFlow v.1.4.1 [1].Table 4.5 lists the hyperparameters of NFCM. We tune the variable hyper-parameters ofthis table on the validation set and optimize for [email protected] 4.5: Hyperparameters of NFCM, tuned on the validation set.
Description Value(s) k during training [1, 10, 100]Learning rate [0.01, 0.001, 0.0001] d z : entity type embedding dimension [64, 128, 256] d p : Predicate embedding dimension [64, 128, 256]RNN cell size [64, 128, 256]RNN cell dropout [0.0, 0.2] α : β : In this section we discuss and analyze the results of our evaluation, answering theresearch questions listed in Section 4.4.In our first experiment, we compare NFCM to a set of heuristic baselines we derivedto answer
RQ3.1 . Table 4.6 shows the results. We observe that NFCM significantlyoutperforms the heuristic baselines by a large margin. We have also experimented withlinear combinations of the above heuristics but the performance does not improve overthe individual ones and therefore we omit those results. We conclude that the taskwe define in this chapter is not trivial to solve and simple heuristic functions are notsufficient.In our second experiment we compare NFCM with distant supervision and aimto answer
RQ3.2 . That is, how does NFCM perform compared to DistSup, a scoringfunction that scores candidate facts w.r.t. a query fact using the relevance labelsgathered from distant supervision. The aim of this experiment is to investigate whetherit is beneficial to learn ranking functions based on the signal gathered from distantsupervision, and to see if we can improve performance over the latter. Table 4.7shows the results. We observe that NFCM significantly outperforms DistSup on MAP,NDCG@5, and NDCG@10 and conclude that learning ranking functions (and inparticular NFCM) based on the signal gathered from distant supervision is beneficialfor this task. We also observe that NFCM performs significantly worse than DistSupon MRR. One possible reason for this is that NFCM returns facts that are indeedrelevant but were not selected for annotation and thus assumed not relevant, sincethe data annotation procedure is biased towards DistSup (see Section 4.4.4). Weaim to validate this hypothesis by conducting an additional user study in future work.55 . Contextualizing Knowledge Graph Facts
Table 4.6: Comparison between NFCM and the heuristic baselines. Significance istested between NFCM and AES, the best performing baseline. We depict a significantimprovement of NFCM over AES for p < . as (cid:78) . Method
MAP NDCG@5 NDCG@10 MRRFI 0.1222 0.0978 0.1149 0.1928APS 0.2147 0.2175 0.2354 0.3760AES 0.2950 0.3284 0.3391 0.5214NFCM (cid:78) (cid:78) (cid:78) (cid:78)
Table 4.7: Comparison between NFCM and the distant supervision baseline. We depicta significant improvement of NFCM over DistSup as (cid:78) and a significant decrease as (cid:72) ( p < . ). Method
MAP NDCG@5 NDCG@10 MRRDistSup 0.2831 0.4489 0.3983
NFCM (cid:78) (cid:78) (cid:78) (cid:72)
Nevertheless, having an automatic method for KG fact contextualization trained withdistant supervision becomes increasingly important for tail entities for which we mightonly have information in the KG itself and not in external text corpora or other sources.In order to answer
RQ3.3 , that is, whether NFCM benefits from both the hand-crafted features and the learned features, we perform an ablation study. Specifically, wetest the following variations of NFCM that only modify the final layer of the architecture(see Section 4.3.2):(i) LF: Keeps the learned features ( v q and v a ), and ignores the hand-crafted features x .(ii) HF: Keeps the hand-crafted features ( x ) and ignores the learned features ( v q and v a ).We tune the parameters of LF and HF on the validation set. Table 4.8 shows theresults. First, we observe that NFCM outperforms HF by a large margin. Also, NFCMoutperforms LF on all metrics (significantly so for MAP and NDCG@10) which meansthat by combining HF and LF we are able to obtain more relevant results at lowerpositions of the ranking. We aim to explore more sophisticated ways of combining LFand HF in future work. In order to verify whether LF and HF have complementarysignals, we plot the per-query differences in NDCG@5 for LF and HF in Figure 4.5.We observe that the performance of LF and HF varies across query facts, confirmingthe hypotheses that LF and HF yield complementary signals.In order to answer RQ3.4 , we conduct a performance analysis per relationship.Figure 4.6 shows the per-relationship NDCG@5 performance of NFCM – query factscores are averaged per relationship. The relationship for which NFCM performs best is profession , which has a NDCG@5 score of 0.8275. The relationship for which NFCM56 .6. Related Work
Table 4.8: Comparison between the full NFCM model and its variations. Significance istested between NFCM and its best variation (LF). We depict a significant improvementof NFCM over LF for p < . as (cid:78) . Method
MAP NDCG@5 NDCG@10 MRRHF 0.4620 0.4753 0.4989 0.7180LF 0.4676 0.4993 0.5134 0.7647NFCM (cid:78) (cid:78) . . . . . N D C G @ Figure 4.5: Per query fact differences in NDCG@5 between the variation of NFCMthat only uses the learned features (LF) and the best-performing variation of NFCM thatonly uses the hand-crafted features (HF). A positive value indicates that LF performsbetter than HF on a query fact and vice versa.performs worst at is awardNominated , which has a NDCG@5 score of 0.1. Furtheranalysis showed that awardNominated has a very large number of candidate facts onaverage, which might explain the poor performance on that relationship.Furthermore, we investigate how the number of queries we have in the training setfor each relationship affects the ranking performance. Figure 4.7 shows the results.From this figure we conclude that there is no clear relationship and thus that NFCM isrobust to the size of the training data for each relationship.Next, we analyse the performance of NFCM with respect to the number of candidatesper query fact; Figure 4.8 shows the results. We observe that the performance decreaseswhen we have more candidate facts for a query, although not by a large margin, and thatthere does not seem to be a clear relationship between performance and the number ofcandidates to rank.
The specific task we introduce in this chapter has not been addressed before, but thereis related work in three main areas: entity relationship explanation, distant supervision,57 . Contextualizing Knowledge Graph Facts . . . . . . N D C G @ Figure 4.6: NDCG@5 for NFCM per relationship.and fact ranking.
Explanations for relationships between pairs of entities can be provided in two ways: structurally , i.e., by providing paths or sub-graphs in a KG containing the entities, or textually , by ranking or generating text snippets that explain the connection.Fang et al. [61] focus on explaining connections between entities by mining relation-ship explanation patterns from the KG. Their approach consists of two main components:explanation enumeration and explanation ranking. The first phase generates all patternsin the form of paths connecting the two entities in the KG, which are then combinedto form explanations. In the final stage, the candidate explanations are ranked usingnotions of interestingness. Seufert et al. [164] propose a similar approach for entitysets. Their method focuses on explaining the connections between entity sets based onthe concept of relatedness cores, i.e., dense subgraphs that have strong relations withboth query sets. Pirr`o [139] also provide explanations of the relation between entitiesin terms of the top-k most informative paths between a query pair of entities; suchpaths are ranked and selected based on path informativeness and diversity, and patterninformativeness.As to textual explanations for entity relationships, Voskarides et al. [185] focus onhuman-readable descriptions. They model the task as a learning to rank problem forsentences and employ a rich set of features. Huang et al. [81] build on the aforemen-tioned work and propose a pairwise ranking model that leverages clickthrough dataand uses a convolutional neural network architecture. While these approaches rankexisting candidate explanations, Voskarides et al. [186] focus on generating explanationsfrom scratch. They automatically identify the most common sentence templates fora particular relationship and, for each new relationship instance, these templates areranked and instantiated using contextual information from the KG.The work described above focuses on explaining entity relationships in KGs; noprevious work has focused on ranking additional KG facts for an input entity relationshipas we do in this chapter.58 .6. Related Work < - - Number of training query facts per relationship0 . . . . . . N D C G @ Figure 4.7: Box plot that shows NDCG@5 per number of training query facts of eachrelationship (binned). Each box shows the median score with an orange line and theupper and lower quartiles (maximum and lower values shown outside each box).
When obtaining labeled data is expensive, training data can be generated automatically.Mintz et al. [122] introduce distant supervision for relation extraction; for a pair ofentities that is connected by a KG relation, they treat all sentences that contain thoseentities in a text corpus as positive examples for that relation. Follow-up work onrelation extraction address the issue of noise related to distant supervision. Alfonsecaet al. [5], Riedel et al. [151], Surdeanu et al. [172] refine the model by relaxing theassumptions in the original method or by modeling noisy labels.Beyond relation extraction, distant supervision has also been applied in other KG-related tasks. Ren et al. [150] introduce a joint approach entity recognition and classifi-cation based on distant supervision. Ling and Weld [109] used distant supervision toautomatically label data for fine-grained entity recognition.
In fact ranking, the goal is to rank a set of attributes with respect to an entity. Hasibiet al. [75] consider fact ranking as a component for entity summarization for entitycards. They approach fact ranking as a learning to rank problem. They learn a rankingmodel based on importance, relevance, and other features relating a query and the facts.Aleman-Meza et al. [4] explore a similar task, but rank facts with respect to a pair ofentities to discover paths that contain informative facts between the pair.Graph matching involves matching two graphs and discovering the patterns ofrelationships between them to infer their similarity [34]. Although our task can be59 . Contextualizing Knowledge Graph Facts < - - - - - Number of candidate facts per query fact0 . . . . . . N D C G @ Figure 4.8: Box plot that shows NDCG@5 per number of candidate facts of each queryfact (binned). Each box shows the median score with an orange line and the upper andlower quartiles (maximum and lower values shown outside each box).considered as comparing a small query subgraph (i.e., query triples) and a knowledgegraph, the goal is different from graph matching which mainly concerns aligning twographs rather than enhancing one query graph.Our work differs from the work discussed above in the following major ways. First, weenrich a query fact between two entities by providing relevant additional facts in thecontext of the query fact, taking into account both the entities and the relation of thequery fact. Second, we rank whole facts from the KG instead of just entities. Last, weprovide a distant supervision framework for generating the training data so as to makeour approach scalable.
In this chapter, we introduced the knowledge graph fact contextualization task andproposed NFCM, a weakly-supervised method to address it. NFCM first generatesa candidate set for a query fact by looking at 1 or 2-hop neighbors and then ranksthe candidate facts using supervised machine learning. NFCM combines handcraftedfeatures with features that are automatically identified using deep learning. We usedistant supervision to boost the gathering of training data by using a large entity-taggedtext corpus that has a high overlap with entities in the KG we use. Our experimentalresults show that (i) distant supervision is an effective means for gathering training datafor this task, (ii) NFCM significantly outperforms several heuristic baselines for thistask, and (iii) both the handcrafted and automatically-learned features contribute to theretrieval effectiveness of NFCM.60 .7. Conclusion
For future work, we aim to explore more sophisticated ways of combining hand-crafted with automatically learned features for ranking. Additionally, we want to exploreother data sources for gathering training data, such as news articles and click logs. Fi-nally, we want to explore methods for combining and presenting the ranked facts insearch engine result pages in a diversified fashion.This chapter concludes our study on the first part of the thesis, which focuses onhow to make structured knowledge more accessible to the user. Next, in Chapter 5, weaddress a different research theme, namely improving interactive knowledge gathering.61 art II
Improving InteractiveKnowledge Gathering Query Resolution for Conversational
Search with Limited Supervision
In the second part of this thesis, we move to the research theme of improving interactiveknowledge gathering and focus on conversational search. In this Chapter, we aim toanswer
RQ4 : Can we use query resolution to identify relevant context and therebyimprove retrieval in conversational search?
Conversational AI deals with developing dialogue systems that enable interactive knowl-edge gathering [64]. A large portion of work in this area has focused on buildingdialogue systems that are capable of engaging with the user through chit-chat [104] orhelping the user complete small well-specified tasks [135]. In order to improve the ca-pability of such systems to engage in complex information seeking conversations [142],researchers have proposed information seeking tasks such as conversational questionanswering (QA) over simple contexts, such as a single-paragraph text [35, 146]. Incontrast to conversational QA over simple contexts, in conversational search, a useraims to interactively find information stored in a large document collection [45].In this chapter, we study multi-turn passage retrieval as an instance of conversationalsearch: given the conversation history (the previous turns) and the current turn query,we aim to retrieve passage-length texts that satisfy the user’s underlying informationneed [46]. Here, the current turn query may be under-specified and thus, we need totake into account context from the conversation history to arrive at a better expression ofthe current turn query. Thus, we need to perform query resolution , that is, add missingcontext from the conversation history to the current turn query, if needed. An example ofan under-specified query can be seen in Table 5.1, turn when was saosin ’s first album released? ”. In this example, contextfrom all turns
This chapter was published as [189]. . Query Resolution for Conversational Search with Limited Supervision Table 5.1: Excerpt from an example conversational dialog. Co-occurring terms in theconversation history and the relevant passage to the current turn (
Turn Query saosin ?2 when was the band founded?3 what was their first album?4 when was the album released? resolved: when was saosin ’s first album released?
Relevant passage to turn : The original lineup for
Saosin , consisting of Burchell,Shekoski, Kennedy and Green, was formed in the summer of 2003. On June 17,the band released their first commercial production, the EP Translating the Name.seeking conversations [207]. These phenomena are not easy to capture with standardNLP tools (e.g., coreference resolution). Also, heuristics such as appending (part of)the conversation history to the current turn query are likely to lead to query drift [123].Recent work has modeled query resolution as a sequence generation task [58, 96, 145].Another way of implicitly solving query resolution is by query modeling [69, 181, 201],which has been studied and developed under the setup of session-based search [29, 30].In this chapter, we propose to model query resolution for conversational search as abinary term classification task: for each term in the previous turns of the conversationdecide whether to add it to the current turn query or not. We propose QuReTeC ( Qu ery Re solution by Te rm C lassification), a query resolution model based on bidirectionaltransformers [182] – more specifically BERT [50]. The model encodes the conversationhistory and the current turn query and uses a term classification layer to predict a binarylabel for each term in the conversation history. We integrate QuReTeC in a standardtwo-step cascade architecture that consists of an initial retrieval step and a rerankingstep. This is done by using the set of terms predicted as relevant by QuReTeC as queryexpansion terms.Training QuReTeC requires binary labels for each term in the conversation history.One way to obtain such labels is to use human-curated gold standard query resolu-tions [58]. However, these labels might be cumbersome to obtain in practice. On theother hand, researchers and practitioners have been collecting general-purpose passagerelevance labels, either by the means of human annotations or by the means of weaksignals, e.g., clicks or mouse movements [88]. We propose a distant supervision methodto automatically generate training data, on the basis of such passage relevance labels.The key assumption is that passages that are relevant to the current turn share contextwith the conversation history that is missing from the current turn query. Table 5.1illustrates this assumption: the relevant passage to turn and theconversation history as relevant for the current turn. A relevance passage contains not only the answer to the question but also context and supporting factsthat allow the algorithm or the human to reach to this answer. .2. Related work Our main contributions can be summarized as follows:1. We model the task of query resolution as a binary term classification task and proposeto address it with a neural model based on bidirectional transformers, QuReTeC.2. We propose a distant supervision approach that can use general-purpose passagerelevance data to substantially reduce the amount of human-curated data required totrain QuReTeC.3. We experimentally show that when integrating the QuReTeC model in a multi-stageranking architecture we significantly outperform baseline models. Also, we conductextensive ablation studies and analyses to shed light into the workings of our queryresolution model and its impact on retrieval performance.
Early studies on conversational search have focused on characterizing informationseeking strategies and building interactive IR systems [16, 17, 43, 131]. Vtyurinaet al. [191] investigated human behaviour in conversational systems through a userstudy and find that existing conversational assistants cannot be effectively used forconversational search with complex information needs. Radlinski and Craswell [143]present a theoretical framework for conversational search, which highlights the need formulti-turn interactions. Dalton et al. [46] organize the Conversational Assistance Track(CAsT) at TREC 2019. The goal of the track is to establish a concrete and standardcollection of data with information needs to make systems directly comparable. Theyrelease a multi-turn passage retrieval dataset annotated by experts, which we use tocompate our method to the baseline methods.
Query resolution has been studied in the context of dialogue systems. Raghu et al.[145] develop a pipeline model for query resolution in dialogues as text generation.Kumar and Joshi [96] follow up on that work by using a sequence to sequence modelcombined with a retrieval model. However, both these works rely on templates that arenot available in our setting. More related to our work, Elgohary et al. [58] studied queryresolution in the context of conversational QA over a single paragraph text. They use asequence to sequence model augmented with a copy and an attention mechanism and acoverage loss. They annotate part of the QuAC dataset [35] with gold standard queryresolutions on which they apply their model and obtain competitive performance. Incontrast to all the aforementioned works that model query resolution as text generation,we model query resolution as binary term classification in the conversation history.
Query modeling has been used in session search, where the task is to retrieve documentsfor a given query by utilizing previous queries and user interactions with the retrievalsystem [29]. Guan et al. [69] extract substrings from the current and previous turn67 . Query Resolution for Conversational Search with Limited Supervision queries to construct a new query for the current turn. Yang et al. [201] propose a querychange model that models both edits between consecutive queries and the ranked listreturned by the previous turn query. Van Gysel et al. [181] compare the lexical matchingsession search approaches and find that naive methods based on term frequency weighingperform on par with specialized session search models. The methods described aboveare informed by studies of how users reformulate their queries and why [167], which,in principle, is different in nature from conversational search. For instance, in sessionsearch users tend to add query terms more than removing query terms, which is notthe case in (spoken) conversational search. Another form of query modeling is queryexpansion. Pseudo-relevance feedback is a query expansion technique that first retrievesa set of documents that are assumed to be relevant to the query, and then selects termsfrom the retrieved documents that are used to expand the query [2, 98, 130]. Note thatpseudo-relevance feedback is fundamentally different from query resolution: in order torevise the query, the former relies on the top-ranked documents, while the latter onlyrelies on the conversation history.
Distant supervision
Distant supervision can be used to obtain large amounts of noisytraining data. One of its most successful applications is relation extraction, first proposedby Mintz et al. [122]. They take as input two entities and a relation between them, gathersentences where the two entities co-occur from a large text corpus, and treat those aspositive examples for training a relation extraction system. Beyond relation extraction,distant supervision has also been used to automatically generate noisy training datafor other tasks such as named entity recognition [204], sentiment classification [152],knowledge graph fact contextualization [187] and dialogue response generation [149].In our work, we follow the distant supervision paradigm to automatically generatetraining data for query resolution in conversational search by using query-passagerelevance labels.
In this section we provide formal definitions and describe our multi-turn passage retrievalpipeline. Table 5.2 lists notation used in this chapter.
Multi-turn passage ranking
Let [ q , . . . , q i − , q i ] be a sequence of conversationalqueries that share a common topic T . Let q i be the current turn query and q i − be theconversation history. Given q i and q i − , the task is to retrieve a ranked list of passages L from a passage collection D that satisfy the user’s information need. In the multi-turn passage ranking task, the current turn query q i is often underspec-ified due to phenomena such as zero anaphora, topic change, and topic return. Thus,context from the conversation history q i − must be taken into account to arrive at a We follow the TREC CAsT setup and only take into account q i − but not the passages retrieved for q i − . .3. Multi-turn Passage Retrieval Pipeline Table 5.2: Notation used in the chapter.
Name Description terms ( x ) Set of terms in term sequence xD Passage collection q i Query at the current turn iq i − Sequence of previous turn queries q ∗ i Gold standard resolution of q i E ∗ q i Gold standard resolution terms for q i , see Eq. (5.2) ˆ q i Predicted resolution of q i p ∗ q i A relevant passage for q i Initial retrievalRerankingQuery resolution q i q i +1 ̂ q i − 1 ̂ q i ̂ q i +1 q i − 1 Figure 5.1: Illustration of our multi-turn passage retrieval pipeline for three turns.better expression of the current turn query q i . This challenge can be addressed by queryresolution. Query resolution
Given the conversation history q i − and the current turn query q i , output a query ˆ q i that includes both the existing information in q i and the missingcontext of q i that exists in the conversation history q i − . Figure 5.1 illustrates our multi-turn passage retrieval pipeline. We use a two-stepcascade ranking architecture [192], which we augment with a query resolution module(Section 5.4). First, the unsupervised initial retrieval step outputs the initial rankedlist L (Section 5.3.2). Second, the re-ranking step outputs the final ranked list L (Section 5.3.2). Below we describe the two steps of the cascade ranking architecture. Initial retrieval step
In this step we obtain the initial ranked list L by scoring each passage p in the passagecollection D with respect to the resolved query ˆ q i using a lexical matching rankingfunction f . We use query likelihood (QL) with Dirichlet smoothing [212] as f , since69 . Query Resolution for Conversational Search with Limited Supervision it outperformed other ranking functions such as BM25 in preliminary experiments overthe TREC CAsT dataset. Reranking step
In this step, we re-rank the list L by scoring each passage p ∈ L with a rankingfunction f to obtain the final ranked list L . To construct f , we use rank fusion andcombine the scores obtained by f (used in initial retrieval step) and a supervised neuralranker f n . Next, we describe the neural ranker f n . Supervised neural ranker
We use BERT [50] as the neural ranker f n , as it hasbeen shown to achieve state-of-the-art performance in ad-hoc retrieval [112, 141, 203].Also, BERT has been shown to prefer semantic matches [141], and thereby can becomplementary to f , which is a lexical matching method. As is standard when usingBERT for pairs of sequences, the input to the model is formatted as [
We design f such that it combines lexical matching and semanticmatching [132]. We use Reciprocal Rank Fusion (RRF) [41] to combine the scoreobtained by the lexical matching ranking function f , and the semantic matchingsupervised neural ranker f n . We choose RRF because of its effectiveness in combiningindividual rankers in ad-hoc retrieval and because of its simplicity (it has only onehyper-parameter). We define f as the RRF of L and L n [41]: f ( p ) = (cid:88) L (cid:48) ∈{ L ,L n } k + rank ( p, L (cid:48) ) , (5.1)where rank ( p, L (cid:48) ) is the rank of passage p in a ranked list L (cid:48) , and k is a hyperparam-eter. We score each passage p in the initial ranked list L with f to obtain the finalranked list L .Since developing specialized re-rankers for the task at hand is not the focus of thiswork, we leave more sophisticated methods for choosing the neural ranker f n and forcombining multiple rankers as future work. In the next section, we describe our queryresolution model, QuReTeC, which is the focus of this work. In this section we first describe how we model query resolution as term classification(Section 5.4.1), then present our query resolution model, QuReTeC, (Section 5.4.2), andfinally describe how we generate distant supervision labels for the model (Section 5.4.3). We set k = 60 and do not tune it. .4. Query Resolution Previous work has modeled query resolution as a sequence to sequence task [58, 96],where the source sequence is q i and the target sequence is q ∗ i , where q ∗ i is a gold stan-dard resolution of the current turn query q i . For instance, the gold standard resolutionof turn q i − and the current turn query q i , output a binary label(relevant or non-relevant) for each term in q i − . Terms in the conversation history q i − that are tagged as relevant are appended to the current turn query q i to form thepredicted current turn query resolution ˆ q i .We define the set of relevant resolution terms E ∗ ( q i ) as: E ∗ q i = terms ( q ∗ i ) ∩ terms ( q i − ) \ terms ( q i ) , (5.2)where q ∗ i is a gold standard resolution of the current turn query q i . Under this formula-tion, the set of relevant terms E ∗ q i represents the missing context from the conversationhistory q i − . For instance, the set of gold standard resolution terms E ∗ q i for turn { Saosin , first } . Note that E ∗ q i can be empty if q i = q ∗ i , i.e., the currentturn query does not need to be resolved, or if terms ( q ∗ i ) ∩ terms ( q i − ) is empty. Inour experiments terms ( q ∗ i ) ∩ terms ( q i − ) ≈ terms ( q ∗ i ) , and therefore almost allthe gold standard resolution terms can be found in the conversation history. In this section, we describe our query resolution model, QuReTeC. Figure 5.2a showsthe model architecture of QuReTeC. Each term in the input sequence is first encodedusing bidirectional transformers [182] – more specifically BERT [50]. Then, a termclassification layer takes each encoded term as input and outputs a score for each term.We use BERT as the encoder since it has been successfully applied in tasks similar toours, such as named entity recognition and coreference resolution [50, 89, 108]. Nextwe describe the main parts of QuReTeC in detail, i.e., input sequence, BERT encoderand Term classification layer.1.
Input sequence.
The input sequence consists of all the terms in the queries of theprevious turns q i − and the current turn q i . It is formed as: [
Input
Sequence < S E P > Label -0 -000000 -0 < C L S > - (b) Example input sequence and gold standard term labels (1: relevant, 0: non-relevant) forQuReTeC. Figure 5.22.
BERT encoder.
BERT first represents the input terms with WordPiece embeddingsusing a 30K vocabulary. After applying multiple transformer blocks, BERT outputsan encoding for each term. We refer the interested reader to the original paper for adetailed description of BERT [50].3.
Term classification layer.
The term classification layer is applied on top of therepresentation of the first sub-token of each term [50]. It consists of a dropout layer,a linear layer and a sigmoid function and outputs a scalar for each term. We maskout the output of
RQ4 , which we break down to the following research sub-questions:
RQ4.1
How does the QuReTeC model perform compared to other state-of-the-artmethods?
RQ4.2
Can we use distant supervision to reduce the amount of human-curated trainingdata required to train QuReTeC?
RQ4.3
How does QuReTeC’s performance vary depending on the turn of the conversa-tion?For all the research questions listed above we measure performance in both an intrinsicand an extrinsic sense.
Intrinsic evaluation measures query resolution performance onterm classification.
Extrinsic evaluation measures retrieval performance at both theinitial retrieval and the reranking steps.
Extrinsic evaluation – retrieval
The TREC CAsT dataset is a multi-turn passage retrieval dataset [46]. It is the onlysuch dataset that is publicly available. Each topic consists of a sequence of queries. Thetopics are open-domain and diverse in terms of their information need. The topics arecurated manually to reflect information seeking conversational structure patterns. Laterturn queries in a topic depend only on the previous turn queries, and not on the returnedpassages of the previous turns, which is a limitation of this dataset. Nonetheless, thedataset is sufficiently challenging for comparing automatic systems, as we will show inSection 5.6.1. Table 5.3 shows statistics of the dataset. The original dataset consists of30 training and 50 evaluation topics. 20 of 50 topics in the evaluation set were annotatedfor relevance by NIST assessors on a 5-point relevance scale. We use this set as the73 . Query Resolution for Conversational Search with Limited Supervision T a b l e . : T R E CC A s T m u lti - t u r np a ss a g e r e t r i e v a l d a t a s e t s t a ti s ti c s . S p lit T op i c s Q u e r i e s L a b e ll e dp a ss a g e s p e r t op i c R e l e v a n t p a ss a g e s p e r t op i c L a b e ll e dp a ss a g e s p e r qu e r y R e l e v a n t p a ss a g e s p e r qu e r y T e s t , . ± . . ± . . ± . . ± . .5. Experimental Setup Table 5.4: Query resolution datasets statistics. In the Split column, we indicate thewhere the positive term labels originate from: either gold (gold standard resolutions) ordistant (Section 5.4.3).Dataset Split ± ± ± ± ± ± ± ± ± ± Intrinsic evaluation – query resolution
The original QuAC dataset [35] contains dialogues on a single Wikipedia article sectionregarding people (e.g., early life of a singer). Each dialogue contains up to 12 questionsand their corresponding answer spans in the section. It was constructed by askingtwo crowdworkers (a student and a teacher) to perform an interactive dialogue about aspecific topic. Elgohary et al. [58] crowdsourced question resolutions for a subset of theoriginal QuAC dataset [35]. All the questions in the dev and test splits of [58] have goldstandard resolutions. We use the dev split for early stopping when training QuReTeCand evaluate on the test set. When training with gold supervision (gold standard queryresolutions), we use the train split from [58], which is a subset of the train split of [35];all the questions therein have gold standard resolutions. Since QuAC is not a passageretrieval collection, in order to obtain distant supervision labels (Section 5.4.3), we usea window of 50 characters around the answer span to extract passage-length texts, andwe treat the extracted passage as the relevant passage. When training with distant labels,we use the part of the train split of [35] that does not have gold standard resolutions.The TREC CAsT dataset [46] also contains gold standard query resolutions for itstest set. However, it is too small to train a supervised query resolution model, and weonly use it as a complementary test set.The two query resolution datasets described above have three main differences.First, the conversations in QuAC are centered around a single Wikipedia article sectionabout people whereas the conversations in CAsT are centered around an arbitrary topic.Second, the answers of the QuAC questions are spans in the Wikipedia section whereasthe CAsT queries have relevant passages that originate from different Web resourcesbesides Wikipedia. Third, later turns in QuAC do depend on the answers in previous The Washington Post collection was also part of the original collection but it was excluded from theofficial TREC evaluation process and therefore we do not use it. . Query Resolution for Conversational Search with Limited Supervision turns, while in CAsT they do not (Section 5.3.1). Interestingly, in Section 5.6.1 wedemonstrate that despite these differences, training QuReTeC on QuAC generalizeswell to the CAsT dataset.Table 5.4 provides statistics for the two datasets. First, we observe that the QuACdataset is much larger than CAsT. Also, QuAC has a larger number of terms on averagethan CAsT ( ∼
97 vs ∼
40) and a larger negative-positive ratio ( ∼ ∼ Extrinsic evaluation – retrieval
We report NDCG@3 (the official TREC CAsT evaluation metric), Recall, MAP, andMRR at rank 1000. We also provide performance metrics averaged per turn to showhow retrieval performance varies across turns.We report on statistical significance with a paired two-tailed t-test. We depict asignificant increase for p < . as (cid:78) . Intrinsic evaluation – query resolution
We report on Micro-Precision (P), Micro-Recall (R) and Micro-F1 (F1), i.e., metricscalculated per query and then averaged across all turns and topics. We ignore queriesthat are the first turn of the conversation when calculating the mean, since we do notpredict term labels for those.
We perform intrinsic and extrinsic evaluation by comparing against a number of queryresolution baselines. Next, we provide a detailed description of each baseline:•
Original
This method uses the original form of the query. We explore differentvariations for constructing ˆ q i : (1) current turn only (cur), (2) current turn expandedby the previous turn (cur+prev), (3) current turn expanded by the first turn (cur+first),and (4) all turns.• RM3 [2]
A state-of-the-art unsupervised pseudo-relevance feedback model. RM3first performs retrieval and treats the top- n ranked passages as relevant. Then, itestimates a query language model based on the top- n results, and finally adds thetop- k terms to the original query. As with Original, we report on different variationsfor constructing the query: cur, cur+prev, cur+first and all turns. In order to applyRM3 for query resolution we append the top- k terms to the original query q i to obtain ˆ q i . Note that the first turn in each topic does not need query resolution because there is no conversationhistory at that point and thus the query resolution CAsT test has 20 (the number of topics) fewer queries thanin Table 5.3. Note that given the very small size of the TREC CAsT training set we do not compare to more sophisti-cated yet data-hungry pseudo-relevance feedback models such as [130]. .5. Experimental Setup • NeuralCoref A coreference resolution method designed for chatbots. It uses a rule-based system for mention detection and a feed-forward neural network that predictscoreference scores. We perform coreference resolution on the conversation history q i − and the current turn query q i . The output ˆ q i consists of q i and the predictedterms in q i − where terms in q i refer to.• BiLSTM-copy [58]
A neural sequence to sequence model for query resolution. Ituses a BiLSTM encoder and decoder augmented with attention and copy mechanismsand also a coverage loss [162]. It initializes the input embeddings with pretrainedGloVe embeddings. Given q i − and q i , it outputs ˆ q i . It was optimized on the QuACgold standard resolutions. Intrinsic evaluation – query resolution
In order to perform intrinsic evaluation on the aforementioned baselines, we take thequery resolution they output ( ˆ q i ) and apply Equation (5.2) by replacing q ∗ i with ˆ q i toobtain the set of predicted resolution terms. Extrinsic evaluation – initial retrieval
Here, apart from the aforementioned baselines, we also use the following baselines:•
Nugget [69]. Extracts substrings from the current and previous turn queries to build anew query for the current turn. • QCM [201]. Models the edits between consecutive queries and the results list returnedby the previous turn query to construct a new query for the current turn.•
Oracle
Performs initial retrieval using the gold standard resolution query. Releasedby the TREC CAsT organizers.
Extrinsic evaluation – reranking
Since developing specialized rerankers for multi-turn passage retrieval is not the focusof this chapter, we evaluate the reranking step using ablation studies. For reference, wealso report on the performance of the top-ranked TREC CAsT 2019 systems [46]:•
TREC-top-auto
Uses an automatic system for query resolution and BERT-large forreranking.•
TREC-top-manual
Uses the gold standard query resolution and BERT-large forreranking.
Multi-turn passage retrieval
We index the TREC CAsT collections using Anseriniwith stopword removal and stemming. In the initial retrieval step (section 5.3.2)we retrieve the top 1000 passages using QL with Dirichlet smoothing (we set µ = https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30 https://nlp.stanford.edu/projects/glove/ We use the nugget version that does not depend on anchors text since they are not available in our setting. https://github.com/castorini/anserini . Query Resolution for Conversational Search with Limited Supervision ). We use the default value for the fusion parameter k = 60 [41] in Eq. (5.1).In the reranking step (section 5.3.2) we use a PyTorch implementation of BERT forretrieval [112]. We use the bert-base-uncased pretrained BERT model. Wefine-tune the BERT reranker with MSMARCO passage ranking dataset [14]. We trainon 100K randomly sampled training triples from its training set and evaluate on 100randomly sampled queries of its development set. We use the Adam optimizer with alearning rate of . except for the BERT layers for which we use a learning rate of − . We apply dropout with a probability of . on the output linear layer. We applyearly stopping on the development set with a patience of 2 epochs based on MRR. Query resolution
We use the bert-large-uncased model. We implementQuReTeC on top of HuggingFace’s PyTorch implementation of BERT. We use theAdam optimizer and tune the learning rate in the range { − , − , − } . We usea batch size of 4 and do gradient clipping with the value of . We apply dropout onthe term classification layer and the BERT layers in the range { . , . , . , . } . Weoptimize for F1 on the QuAC dev (gold) set. Baselines
For RM3, we tune the following parameters: n ∈ { , , , , } and k ∈{ , } and set the original query weight to the default value of . . For Nugget, we set k snippet = 10 and tune θ ∈ { . , . , . } . For QCM, we tune α ∈ { . , . , . } , β ∈ { . , . , . } , (cid:15) ∈ { . , . , . } and δ ∈ { . , . , . } . For both Nuggetand QCM we use Van Gysel et al. [181]’s implementation. For fair comparison, weretrieve over the whole collection rather than just reranking the top-1000 results. Theaformentioned methods are tuned on the small annotated training set of TREC CAsT.For query resolution, we tune the greedyness parameter of NeuralCoref in the range { . , . } . We use the model of BiLSTM-copy released by [58], as it was optimizedspecifically for QuAC with gold standard resolutions. Preprocessing
We apply lowercase, lemmatization and stopword removal to q ∗ i , q i − and q i using Spacy before calculating term overlap in Equation 5.2. In this section we present and discuss our experimental results.
In this subsection we answer
RQ4.1 : we study how QuReTeC performs compared toother state-of-the-art methods when evaluated on term classification (Section 5.6.1),when incorporated in the initial retrieval step (Section 5.6.1) and in the reranking step(Section 5.6.1). https://github.com/huggingface/transformers http://spacy.io/ .6. Results & Discussion Table 5.5: Intrinsic evaluation for query resolution on the QuAC test set. Cur, prev, firstand all refer to using the current, previous, first or all turns respectively.
Method P R F1
Original (cur+prev) 22.3 46.4 30.1Original (cur+first) 41.1 49.5 44.9Original (all) 12.3
Table 5.6: Intrinsic evaluation for query resolution on the TREC CAsT test set. Cur,prev, first and all refer to using the current, previous, first, or all turns respectively.
Method P R F1
Original (cur+prev) 32.5 43.9 37.4Original (cur+first) 43.0 74.0 54.4Original (all) 18.6
Intrinsic evaluation
In this experiment we evaluate query resolution as a term classification task. Table 5.5shows the query resolution results on the QuAC dataset. We observe that QuReTeCoutperforms all the variations of Original and the NeuralCoref by a large margin in termsof F1, precision and recall – except for Original (all) that has perfect recall but at thecost of very poor precision. Also, QuReTeC substantially outperforms BiLSTM-copyon all metrics. Note that BiLSTM-copy was optimized on the same training set asQuReTeC (see Section 5.5.5). This shows that QuReTeC is more effective in findingmissing contextual information from previous turns.Table 5.6 shows the query resolution results on the CAsT dataset. Generally, weobserve similar patterns in terms of overall performance as in Table 5.5. Interestingly,we observe that QuReTeC generalizes very well to the CAsT dataset (even though it Note that the performance of Original (cur) is zero by definition when using the current turn only (seeEq. 5.2). Thus, we do not include it in Tables 5.5 and 5.6. Also, RM3 is not applicable in Table 5.5 sinceQuAC is not a retrieval dataset. . Query Resolution for Conversational Search with Limited Supervision Table 5.7: Initial retrieval performance on the TREC CAsT test set for different queryresolution methods. The retrieval model is fixed (same as in Section 5.3.2). Significanceis tested against RM3 (cur+first) since it has the best NDCG@3 among the baselines.
Method Recall MAP MRR NDCG@3
Original (cur) 0.438 0.129 0.310 0.155Original (cur+prev) 0.572 0.181 0.475 0.235Original (cur+first) 0.655 0.214 0.561 0.282Original (all) 0.694 0.190 0.552 0.256RM3 (cur) 0.440 0.140 0.320 0.158RM3 (cur+prev) 0.575 0.200 0.482 0.254RM3 (cur+first) 0.656 0.225 0.551 0.300RM3 (all) 0.666 0.195 0.544 0.266Nugget 0.426 0.101 0.334 0.145QCM 0.392 0.091 0.317 0.127NeuralCoref 0.565 0.176 0.423 0.212BiLSTM-copy 0.552 0.171 0.403 0.205QuReTeC (cid:78) (cid:78) (cid:78) (cid:78)
Oracle 0.785 0.309 0.660 0.361was only trained on QuAC) and outperforms all the baselines in terms of F1 by a largemargin. In contrast, BiLSTM-copy fails to generalize and performs worse than Original(cur+first) in terms of F1. NeuralCoref has higher precision but much lower recallcompared to QuReTeC. Finally, RM3 has relatively poor query resolution performance.This indicates that pseudo-relevance feedback is not suitable for the task of queryresolution.
Query resolution for initial retrieval
In this experiment, we evaluate query resolution when incorporated in the initial retrievalstep (Section 5.3.2). We compare QuReTeC to the baseline methods in terms of initialretrieval performance. Table 5.7 shows the results. First, we observe that QuReTeCoutperforms all the baselines by a large margin on all metrics. Also, interestingly,QuReTeC achieves performance close to the one achieved by the Oracle performance(gold standard resolutions). Note that there is still plenty of room for improvement evenwhen using Oracle, which indicates that exploring other ranking functions for initialretrieval is a promising direction for future work. QuReTeC outperforms all Originaland RM3 variations, which perform similarly. The session search methods (Nuggetand QCM) perform poorly even compared to the Original variations, which indicatesthat session search is different in nature than conversational search. BiLSTM-copyperforms poorly compared to QuReTeC but also compared to the Original variations,which means that it does not generalize well to CAsT.80 .6. Results & Discussion
Table 5.8: Reranking performance on the TREC CAsT test set. All the methods in thefirst group use QuReTeC for query resolution. Significance is tested against BERT-base.
Method MAP MRR NDCG@3
Initial 0.272 0.637 0.341BERT-base 0.272 0.693 0.408RRF (Initial + BERT-base) (cid:78) (cid:78) (cid:78)
Oracle 0.754 0.956 0.926TREC-top-auto 0.267 0.715 0.436TREC-top-manual 0.405 0.879 0.589
Query resolution for reranking
In this experiment, we study the effect of QuReTeC when incorporated in the rerankingstep (Section 5.3.2). We keep the initial ranker fixed for all QuReTeC models. Table 5.8shows the results. First, we see that BERT-base improves over the initial retrieval modelthat uses QuReTeC for query resolution on the top positions (second line). Second,when we fuse the ranked listed retrieved by BERT-base and the ranked list retrievalby the initial retrieval ranker using RRF, we significantly outperform BERT-base onall metrics (third line). This shows that the two rankers can be effectively combinedwith RRF, which is a very simple fusion method that only has one parameter whichwe do not tune. We also see that our best model outperforms TREC-top-auto on allmetrics. Furthermore, by comparing RRF (line 3) to Oracle (line 4) we see that thereis still plenty of room for improvement for reranking, which is a clear direction forfuture work. This also shows that the TREC CAsT dataset is sufficiently challengingfor comparing automatic systems. Note that TREC-top-manual uses the gold standardquery resolutions and is thereby not directly comparable with the rest of the methods.
In this section we answer
RQ4.2 : Can we use distant supervision to reduce the amountof human-curated query resolution data required to train QuReTeC? Figure 5.3 showsthe query resolution performance when training QuReTeC under different settings (seefigure caption for a more detailed description). For QuReTeC (distant full & gold partial)we first pretrain QuReTeC on distant and then resume training with different fractionsof gold. First, we see that QuReTeC performs competitively with BiLSTM-copy evenwhen it does not use any gold resolutions (distant full). More importantly, when onlytrained on distant, QuReTeC performs remarkably well in the low data regime. In fact,it outperforms BiLSTM-copy (trained on gold) even when using a surprisingly lownumber of gold standard query resolutions (200, which is ∼
1% of gold). Last, we seethat as we add more labelled data, the effect of distant supervision becomes smaller. Also, when trained with distant full, QuReTeC performs better than an artificial method that uses thelabel of the distant supervision signal as the prediction in terms of F1 (56.5 vs 41.6). This is in line withprevious work that successfully uses noisy supervision signals for retrieval tasks [49, 187]. . Query Resolution for Conversational Search with Limited Supervision K K K K K Number of gold standard query resolutions used F QuReTeC (distant full & gold partial)QuReTeC (gold partial)QuReTeC (gold full)QuReTeC (distant full)BiLSTM-copy (gold full)
Figure 5.3: Query resolution performance (intrinsic) on the QuAC test set on differentsupervision settings. Gold refers to the QuAC train (gold) dataset and distant refers tothe QuAC train (distant) dataset. Full refers to the whole and partial refers to a part ofthe corresponding dataset (gold or distant). The x-axis is plotted in log-scale.This is expected and is also the case for the model trained on QuAC train (gold). In order to test whether our distant supervision method can be applied on differentencoders, we performed an additional experiment where we replaced BERT with asimple BiLSTM as the encoder in QuReTeC. Similarly to the previous experiment, weobserved a substantial increase in F1 when retraining with 2K gold standard resolutions(+12 F1) over when only using gold resolutions.In conclusion, our distant supervision method can be used to substantially decreasethe amount of human-curated training data required to train QuReTeC. This is especiallyimportant in low resource scenarios (e.g. new domains or languages), where large-scalehuman-curated training data might not be readily available.
In this section we perform analysis on QuReTeC when trained with gold standardsupervision.
Query resolution performance per turn
Here we answer
RQ4.3 by analyzing the robustness of QuReTeC at later conversationturns. We expect query resolution to become more challenging as the conversationhistory becomes larger (later in the conversation). In fact (not shown in Figure 5.3), performance stabilizes after 15K query resolutions ( ∼
75% of gold full). .6. Results & Discussion Turn . . . . . . PrecisionRecallF1 (a) QuAC test
Turn . . . . . . PrecisionRecallF1 (b) CAsT test
Figure 5.4: Intrinsic query resolution evaluation (term classification performance) forQuReTeC, averaged per turn.
Turn . . . . . R ec a ll Original (cur)RM3 (cur+first)BiLSTM-copyQuReTeC (a) Recall
Turn . . . . N D C G @ Original (cur)RM3 (cur+first)BiLSTM-copyQuReTeC (b) NDCG@5
Figure 5.5: Initial retrieval performance per turn for different query resolution methodsCAsT test
Intrinsic
Figure 5.4 shows the QuReTeC performance averaged per turn on theQuAC and CAsT datasets. Even though performance decreases towards later turns asexpected, we observe that it decreases very gradually, and thus we can conclude thatQuReTeC is relatively robust across turns.
Extrinsic – initial retrieval
Figure 5.5 shows the performance of different queryresolution methods when incorporated in the initial retrieval step. We observe thatQuReTeC is robust to later turns in the conversation, whereas the performance of allthe baseline models decreases faster (especially in terms of recall). For reranking, weobserve similar patterns as with initial retrieval; we do not include those results forbrevity.
Qualitative analysis
Here we perform qualitative analysis by sampling specific instances from the data.
Intrinsic
Table 5.9 shows one success and one failure case for QuReTeC from the QuACdev set. In the success case (top) we observe that QuReTeC succeeds in resolving “she” → { “Bipasha”, “Basu” } and “reviews” → “Anjabee”. Note that “Anjabee” is a movie83 . Query Resolution for Conversational Search with Limited Supervision Table 5.9: Qualitative analysis for QuReTeC on query resolution (intrinsic). We denotetrue positive terms with underline and false negative terms in italics. The examples aresampled from the QuAC dev set.
Success case – no mistakesQ1: What was Bipasha Basu’s debut?A1: In 2001, Basu finally made her debut opposite Akshay Kumar in Vijay Galani ’sAjnabee.Q2: Did this help her become well known?A2: It was a moderate box-office success and attracted unfavorable reviews fromcritics.Q3 (current): Why did she receive unfavorable reviews?
Failure case – misses two relevant terms: dehusking , machine Q1: How old was Alexander Graham Bell when he made his first invention?A1: The age of 12.Q2: What did he invent?A2: Bell built a homemade device that combined rotating paddles with sets of nailbrushes.Q3: What was it for?A3: A simple dehusking machine .Q4 (current): By inventing this, what happened to allow him to continue inventingthings?in which Basu acted but is not mentioned explicitly in the current turn. In the failurecase (bottom) we observe that QuReTeC succeeds in resolving “him” → { “Alexander”,“Graham” “Bell” } but misses the connection between “this” and “dehusking machine”. Extrinsic – initial retrieval
Table 5.10 shows an example from the CAsT testset where QuReTeC succeeds and RM3 (cur+first), the best performing baseline forinitial retrieval, fails. First, note that a topic change happens at Q7 (the topic changesfrom general real-time databases to Firebase DB). We observe that QuReTeC predictsthe correct terms, and a relevant passage is retrieved at the top position. In contrast,RM3 (cur+first) fails to detect this topic change and therefore an irrelevant passage isretrieved at the top position that is about real-time databases on mobile apps but notabout Firebase DB.
In this chapter, we studied the task of query resolution for conversational search. Weproposed to model query resolution as a binary term classification task: whether to addterms from the conversation history to the current turn query. We proposed QuReTeC,a neural query resolution model based on bidirectional transformers. We proposeda distant supervision method to gather training data for QuReTeC. We found thatQuReTeC significantly outperforms multiple baselines of different nature and is robust84 .7. Conclusion
Table 5.10: Qualitative analysis for initial retrieval (extrinsic) when using QuReTeC orRM3 (cur+first) for query resolution. The example is sampled from the TREC CAsTdataset.Q1: What is a real-time database?Q2: How does it differ from traditional ones?Q3: What are the advantages of real-time processing?Q4: What are examples of important ones?Q5: What are important applications?Q6: What are important cloud options?Q7: Tell me about the Firebase DB?Q8 (current): How is it used in mobile apps?
Predicted terms – QuReTeC : { “database”, “firebase”, “db” } Top-ranked passage – QuReTeC
Firebase is a mobile and web application platform . . . Firebase’s initial product wasa realtime database, . . . Over time, it has expanded its product line to become a fullsuite for app development . . .
Predicted terms – RM3 (cur+first) : { “real”, “time”, “database” } Top-ranked passage – RM3 (cur+first)
There are two options in Jedox to access the central OLAP database and softwarefunctionality on mobile devices: Users can access reports through the touch-optimizedJedox Web Server . . . on their smart phones and tablets.across conversation turns. Also, we found that our distant supervision method cansubstantially reduce the required amount of gold standard query resolutions required fortraining QuReTeC, using only query-passage relevance labels. This result is especiallyimportant in low resource scenarios, where gold standard query resolutions might notbe readily available.As for future work, we aim to develop specialized rankers for both the initial retrievaland the reranking steps that incorporate QuReTeC in a more sophisticated way. Also,we want to study how to effectively combine QuReTeC with text generation queryresolution methods as well as pseudo-relevance feedback methods. Finally, we aim toexplore weak supervision signals for training QuReTeC [49].In this chapter, we focused on how to improve interactive knowledge gathering andstudied multi-turn passage retrieval as an instance of conversational search. In Chapter 6,we focus on a different research theme, namely supporting knowledge exploration fornarrative creation. 85 art III
Supporting KnowledgeExploration for NarrativeCreation News Article Retrieval in Contextfor Event-centric Narrative Creation
In the third and final part of this thesis we study the research theme of supportingknowledge exploration for narrative creation. In this chapter, we address
RQ5 : Can wesupport knowledge exploration for event-centric narrative creation by performing newsarticle retrieval in context?
Professional writers such as journalists generate narratives centered around specificevents or topics. As shown in recent studies, such writers envision automatic systemsthat suggest material relevant to the narrative they are creating [51, 83]. This materialmay provide background information or connections that can help writers generate newangles on the narrative and thus help engage the reader [93].Previous work has focused on developing automatic systems to support writersexplore content relevant to the narrative they are writing about. Such systems usecontent originating from various sources such as as social media [44, 52, 213], politicalspeeches and conference transcripts [113], or news articles [114].Writers in the news domain often develop narratives around a single main event,and refer to other, related events that can serve different functions in relation to thenarrative [180]. These include explaining the cause or the context of the main eventor providing supporting information, among others [37]. Recent work has focused onautomatically profiling news article content (i.e., paragraphs or sentences) in relation totheir discourse function [37, 206].In this chapter, instead of profiling existing narratives, we consider a scenario wherea writer has generated an incomplete narrative about a specific event up to a certainpoint, and aims to explore other news articles that discuss relevant events to include intheir narrative. A news article that discusses a different event from the past is relevantto the writer’s incomplete narrative if it relates to the narrative’s main event and to the narrative’s context . Relevance to the narrative’s main event is topical in nature but,importantly, relevance to the narrative’s context is not only topical: to be relevant to the
This chapter was published as [190]. . News Article Retrieval in Context for Event-centric Narrative Creation Table 6.1: Example incomplete narrative q (consisting of a main event e and context c ),and a news article d ∗ that is relevant to q because it is relevant to both the main event e and to the narrative context c in the sense explained in the main text. Incomplete narrative q – Main event ( e )( – Narrative context ( c )( Relevant news article ( d ∗ )( Italy ). Here, the relevant news article isrelevant to the topic of the incomplete narrative ( migration crisis ) and also relevant tothe narrative context in the sense that it is used by the writer to expand the narrativeby making a comparison: the previous government of Italy was more welcoming toimmigrants than the current. To avoid confusion, in the remainder of this chapter relevance without further restriction or scope is taken to mean both topical relevance and relevance to the narrative context .We model the problem of finding a relevant news article given an incompletenarrative as a retrieval task where the query is an incomplete narrative and the unit ofretrieval is a news article. We automatically generate retrieval datasets for this task byharvesting links from existing narratives manually created by journalists. Using thegenerated datasets, we analyze the characteristics of this task and study the performanceof different rankers on this task. We find that state-of-the-art lexical and semanticrankers are not sufficient for this task and that combining those with a ranker that ranksarticles by their reverse chronological order outperforms those rankers alone.Our main contributions are: (i) we propose the task of news article retrieval incontext for event-centric narrative creation; (ii) we propose an automatic retrieval datasetconstruction procedure for this task; and (iii) we empirically evaluate the performanceof different rankers on this task and perform an in-depth analysis of the results to betterunderstand the characteristics of this task.90 .2. Problem Statement A news article d published at time t consists of its headline H —which introduces thetopic of the article [180]—and a sequence of paragraphs p , p , . . . . Each paragraph p i consists of a sequence of sentences a i, , a i, , . . . .The lead paragraph L of a news article d is its first paragraph p , which summarizesthe main topic of the article [180].An event e is characterized by interactions between entities such as countries,organizations, or individuals—that deviate from typical interaction patterns [32]. Weassume that each news article d is associated with a single main event e .A link sentence a i,j in article d is a sentence that contains a hyperlink to a newsarticle d ∗ .A context is a sequence of sentences already generated by the writer that introducesa new idea or subtopic in a narrative.A query q = ( e, c, t ) is an incomplete narrative at time t that consists of an event e and a context c . The task of news article retrieval in context for event-centric narrative creation is definedas follows. Given a query q = ( e, c, t ) and a collection of news articles D publishedbefore time t , we need to rank articles in D w.r.t. their relevance to q = ( e, c, t ) . Here,“relevance to e ” is to be interpreted as topical, whereas “relevance to c ” is not onlytopical, but it should also enable the continuation of the narrative by expanding thenarrative discourse [31]. “Relevance to q ” is taken to mean the same as “relevance to e and to c ”. An article relevant to q can thus be used by the writer to create the nextsentence in the yet incomplete narrative. Table 6.1 shows an example query q and arelevant news article d ∗ published at time t ∗ < t . In order to construct a retrieval dataset for our news article retrieval task, we rely onexisting news articles to simulate incomplete narratives as well as relevant documents.We capitalize on the fact that (complete) news articles often contain links to other newsarticles manually inserted by journalists in the form of hyperlinks.The automatic retrieval dataset construction procedure that we propose takes asinput a news article d and outputs a set of ( q, d ∗ ) pairs, where q = ( e, c, t ) is a queryand d ∗ is the (unique) relevant news article to q . We assume that the event e associatedwith d is described by the headline H and the lead paragraph L of d [37].In order to construct the context c of q , we iteratively look for link sentences a i,j in d that contain a hyperlink to another news article d ∗ . We enforce i > so thatthe paragraph where the link sentence appears is after the lead paragraph. We also91 . News Article Retrieval in Context for Event-centric Narrative Creation Table 6.2: Statistics of the retrieval datasets derived from the WaPo and Guardiannewspaper collections. Because of the way we construct the retrieval datasets (seeSection 6.3.1), each query has a single relevant news article.
Dataset Split q d d ∗ Link sentence ( a i,j ) i mean/ j mean/median medianWaPo Train 32,963 23,537 24,279 7.9/7 2.5/2Dev. 1,831 1,286 1,585 8.4/8 2.4/2Test 1,832 1,216 1,555 9.1/9 2.4/2Guardian Train 31,329 21,730 22,935 7.3/6 2.4/2Dev. 1,740 1,128 1,526 8.0/7 2.4/2Test 1,742 1,064 1,532 7.3/7 2.5/2enforce j > motivated by the fact that links after the first sentence of a paragraph aretightly related to the main idea of the paragraph, therefore the sentences preceding thelink sentence can be considered as context [73]. If such a link sentence a i,j exists, weconsider the sentences a i, , . . . , a i,j − as the narrative context c and the article d ∗ asthe relevant article for q . Example
To illustrate the procedure described above, consider the example in Ta-ble 6.1. Sentences d respectively. Sentence a i,j − of a paragraph p i , i > in d , which constitutes the narrative context c . The link sentence a i,j (notshown in the table) is:But over the past year, Italy has closed its ports to migrants rescued byhumanitarian boats.where the part in italics is (the anchor text of) a hyperlink to the relevant news article d ∗ shown in Table 6.1, where sentences d ∗ ,respectively. We consider two collections of news articles written in English and published by majornewspapers. The first is a set of news articles published by The Washington Post (WaPo),released by the TREC News Track [169]. It contains 671,947 news articles and blogposts published from January 2012 to December 2019. The second is a set of newsarticles published by The Guardian, between November 2013 to June 2017, which wecrawl ourselves. We also crawl the out-links of each article in this set; the final setcontains 572,639 news articles published between January 2000 and March 2018.The articles in both newspapers cover multiple genres and domains. In orderto ensure that the news articles describe real-world events, we filter out blog postsand opinion news articles, and only keep articles in the following domains: news ,92 .3. Retrieval Dataset Construction F r e q u e n c y Query EventQuery Context (a) WaPo dev. F r e q u e n c y Query EventQuery Context (b) Guardian dev.
Figure 6.1: Histogram of the number of tokens in the query event e and the querycontext e . Day difference C o un t (a) WaPo dev. Day difference C o un t (b) Guardian dev. Figure 6.2: Histogram of day difference between the query and its relevant news article. world , business , environment , technology , society , science , culture , education , global , healthcare , media , money , teacher , local , national . After filtering for genre and domain,we are left with 386,196 articles in WaPo and 185,034 in The Guardian.We then apply the dataset construction procedure described in Section 6.3.1 toconstruct a retrieval dataset for both collections. We split the retrieval datasets chrono-logically and keep the first 90% for training, the next 5% for development, and the last5% for testing. Table 6.2 shows basic statistics for both retrieval datasets. Figure 6.1shows a histogram of the number of tokens in the query event e and the query context c .We observe that the query context is shorter than the query event in both datasets. Also,the query event is longer in WaPo than in Guardian because the way those newspapersperform paragraph splitting is different.Figure 6.2 shows a histogram of the difference in number of days between thepublication date of the query and the publication date of the relevant news article on thedevelopment sets of the two datasets. The retrieval datasets have a strong recency bias,which is in line with studies on content generation in the news domain [129]. Typicalexamples of recent, relevant articles are those discussing a previous development of aquery event or of an event mentioned in the narrative context. And a typical example of93 . News Article Retrieval in Context for Event-centric Narrative Creation Table 6.3: Results of the annotation exercise: assessing relevance of document d ∗ w.r.t. e only (Task 1), and then c (Task 2). We show the fraction of times the annotator labeleda sample as positive for the task.Dataset Task 1 Task 2 EitherWaPo 0.90 0.77 0.91Guardian 0.85 0.83 0.92a less recent, relevant article can be found when discussing an event that is similar toone mentioned in the query (e.g., an earthquake) but involving different entities (e.g., aperson, location, or organization). The dataset construction procedure we described in Section 6.3.1 assumes that anarticle d ∗ is relevant to q since the writer has chosen to link to it in a particular context,which is a fair assumption to make. Nevertheless, we further assess the quality ofthe automatically constructed retrieval datasets with respect to our task definition(Section 6.2.2) by performing two annotation tasks. In the first task, we show e and d ∗ to a human annotator and ask whether they understand their connection (binary). In thesecond task, which is done after the completion of the first task, we additionally showthe context c and ask whether it enhances their understanding of the connection of e and d ∗ (binary). The two tasks can help us validate whether d ∗ is topically relevant to e ,and relevant to c in a way that enables the continuation of the narrative (Section 6.2.2).One annotator annotated 100 examples from the development set of each dataset(i.e., 200 examples in total). In order to assess the quality of the annotations, a secondassessor annotated a subset of 50 examples from each dataset (100 examples in total).The Cohen’s κ [40] score is 0.61 for Task 1 and 0.50 for Task 2, both of which areconsidered moderate agreement.The results can be seen in Table 6.3. We see that, for both datasets, the context c enhances the understanding of the connection to d ∗ for more than 3/4 of the cases (Task2). Also, for the vast majority of the cases, either the event e or the context c is sufficientto understand the connection (third column). We conclude that the automatic datasetconstruction procedure we proposed in Section 6.3.1 can produce reliable datasets forthis task. We follow a standard two-step retrieval pipeline that consists of (1) an unsupervisedinitial retrieval step and (2) a re-ranking step [192]. Note that we do not focus onproposing new methods but rather on studying existing ones on this novel task.94 .4. Retrieval Method
In this step, we score each news article d in D w.r.t. q = ( e, c, t ) to obtain the initialranked list L . Here, we are interested in achieving high recall at lower depths in theranking, since this step is followed by a more sophisticated reranking step. We useBM25 [153], an unsupervised lexical matching function, which is effective for ad-hocretrieval and other tasks, such as question answering [202]. In order to construct thelexical query, we simply concatenate e and c . Here we rerank the initial ranked list L obtained in the previous step by combiningthe results of multiple rankers using Reciprocal Rank Fusion (RRF), an unsupervisedranking fusion function [41]: (cid:88) L ∈L k + rank ( d, L ) , (6.1)where L is a set of ranked lists, rank ( d, L ) is the rank of article d in the ranked list L ,and k is a parameter, set to its default value (60).We use the following rankers: BM25
The initial retrieval step ranker (Section 6.4.1), often used in combination withmore sophisticated ranking models [111].
BERT
BERT [50] has recently achieved state-of-the-art performance for retrieval andrecommendation tasks in the news domain [198, 203]. BERT has been shown to prefersemantic matches and it is often used in combination with lexical matching rankingfunctions [141]. Given the query q and a candidate news article d , we follow [113]and construct the input to BERT as follows: [
This ranker simply sorts the candidate articles in L by their reversed chrono-logical order.Note that we have also experimented with combining the scores of the above rankers asfeatures in supervised learning to rank models but they only gave minor improvementsover RRF. Thus we do not discuss them in this chapter. 95 . News Article Retrieval in Context for Event-centric Narrative Creation We use standard IR metrics: Mean Reciprocal Rank (MRR) and recall at differentcut-offs (R@20, R@1000). Note that because of the way we construct our dataset(Section 6.3.1), we only have one relevant news article per query and thus MRR isequivalent to MAP. We use a cut-off of 20 at recall since we expect writers to be willingto navigate the ranked list to lower positions [91]. We report on statistical significancewith a paired two-tailed t-test.
We use the BM25 implementation of Anserini [202] with default parameters and retrievethe top-1000 articles (Section 6.4.1).We use the OpenNIR implementation of BERT for retrieval [111]. We fine-tunethe bert-base pre-trained model on the training set of each of our datasets separately.We assign a maximum 300 tokens for the query q and 200 for the article d . We use abatch size of 16 with gradient accumulation of 2, we apply max grad norm of 1 andtune the following hyperparameters for MRR on the development set of each datasetseparately: number of negatives { , , } and learning rate { e − , e − , e − } .During training we sample one negative example from the initial ranked list obtained inSection 6.4.1, and train the model with pairwise ranking loss. Preprocessing and word vectors
We use Spacy for sentence splitting, POS taggingand Named Entity Recognition. We use the en core web lg model to obtain wordvectors. In this section we present our experimental results.
We examine the performance of the initial retrieval step when different variations of thequery q are used. Table 6.4 shows the results. We observe that, for both datasets, whenusing both the event e and the context c we get better results than when using either ofthe two alone, especially in terms of R@1000. This shows that both the event e and thecontext c are important for our task.In Table 6.4 (bottom row) we also show ranking performance when using the linksentence as the query (see Section 6.3.1). Even though we do not use the link sentenceas part of the query in our task definition (Section 6.2.2), this can give us a referencepoint for the “upper bound” performance in this step, since the link sentence has a highlexical overlap with the relevant article d ∗ [136]. We observe that, indeed, when using http://spacy.io/ .7. Analysis Table 6.4: Initial retrieval performance of BM25 on the test sets for different variationsof the query q = ( e, c, t ) , or the link sentence (LS). WaPo Guardian
Query
MRR R@1000 MRR R@1000 e c e & c LS 0.459 0.944 0.427 0.929Table 6.5: Retrieval performance when reranking the ranked list obtained by BM25(first row).
WaPo GuardianMethod
MRR R@20 MRR R@20
BM25 0.172 0.433 0.149 0.382Recency 0.086 0.284 0.065 0.065BERT 0.182 0.451 0.173 0.447RRF-recency 0.206 0.509 0.195 0.477RRF the link sentence as the query, ranking performance is much higher than when using q , achieving close to perfect R@1000. Nevertheless, R@1000 when using e & c isrelatively close to when using LS, which is an encouraging result given that in this stepwe are more interested in recall. Here, we report results on the individual rankers described in Section 6.4.2 and theircombinations with RRF. Table 6.5 shows the results. First, we see that the performanceof the Recency ranker is poor. Also, we see that BERT outperforms BM25 on bothdatasets, while only using the headline and the lead of the candidate news article. RRF-recency combines BERT and BM25 achieves an increase over BERT. Finally, whenalso adding the Recency ranker in RRF, we observe a significant ( p < . ) increase onall metrics. We conclude that RRF, albeit simple, is effective in combining the threerankers and that all three rankers are useful for this task. In this section we analyze our results along different dimensions to gain further insightsinto this task. For our analysis we use the development set of the WaPo and Guardian97 . News Article Retrieval in Context for Event-centric Narrative Creation ( - . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Jaccard similarity M RR BM25BERTRRF-recencyRRF (a) WaPo ( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Jaccard similarity M RR BM25BERTRRF-recencyRRF (b) Guardian
Figure 6.3: MRR vs Jaccard similarity between query q and d ∗ . ( - . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Jaccard similarity M RR BM25BERTRRF-recencyRRF (a) WaPo ( - . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Jaccard similarity M RR BM25BERTRRF-recencyRRF (b) Guardian
Figure 6.4: MRR vs Jaccard similarity between narrative’s context c and d ∗ .datasets. The vocabulary gap is a well known challenge in information retrieval [103]. Here, weanalyze the performance of the rankers under comparison for this task based on thevocabulary gap between the query q and the relevant article d ∗ .In Figure 6.3 we observe that the higher the lexical overlap between q and d ∗ (smallvocabulary gap) the higher the performance for all rankers, for both datasets. Also,we see that when the lexical overlap is low (large vocabulary gap), all rankers fail tobring the relevant article at the top positions of the ranking. This shows that moresophisticated methods are needed to handle the large vocabulary gap in this task. InFigure 6.4 we show the lexical overlap between the narrative’s context c only and the98 .7. Analysis (0, 30] (30, 365] (365, 10000] Day difference M RR
537 647 391BM25BERTRRF-recencyRRF (a) WaPo (0, 30] (30, 365] (365, 10000]
Day difference M RR
443 566 401BM25BERTRRF-recencyRRF (b) Guardian
Figure 6.5: MRR for retrieval methods grouped per day difference of the query and therelevant article.relevant d ∗ . Even though it follows the same trend as in Figure 6.3, we see that BERTis consistently better than BM25 as the term overlap between the narrative’s context c and d ∗ increases, for both datasets. This shows that BERT is able to better take intoaccount the narrative’s context c than BM25.We next show examples of high/low lexical overlap between q and d ∗ in Table 6.6.In the first example (high lexical overlap), we see that because of high term overlap, allrankers are able to rank d ∗ at the top 1–2 positions. In the second example (low lexicaloverlap), the relevant article d ∗ discusses the execution of Alfredo Prieto: this is a casein which Morrogh, a prosecutor in Virginia, was involved in (Morrogh is mentionedin the narrative’s context c ). However, the fact that Morrogh is involved in the case isnot mentioned explicitly in d ∗ and thus all rankers fail to rank the relevant article at thetop positions. Incorporating the fact that Morrogh is related to Prieto in the rankingmodel could potentially be achieved by exploiting knowledge graphs that store eventinformation [67, 154]. We leave the exploration towards this direction for future work. As discussed in Section 6.3.1, the retrieval datasets we derived for this task have a strongrecency bias. Here, we analyze the performance of the rankers under comparison basedon the temporal aspect, i.e., how recent the relevant article is.In Figure 6.5 we show the performance of the retrieval methods for different daydifferences between the query q and the relevant article d ∗ . As expected, we observethat for RRF, which uses the recency signal, the performance increases substantially onaverage when the relevant article is recent, and decreases when it is older.We next look at specific examples to better understand the results. Table 6.7 showsexamples where the relevant article is recent and RRF ranks it at the top of the ranking,while RRF-recency ranks it lower. In both examples, RRF-recency’s top-ranked articleseems to also be relevant to q , however the writer chose to refer to a more recentevent [129]. Note that the fact that only one article is relevant to each query is an artifactof our dataset and not of the task itself. Table 6.8 shows examples where the relevant99 . News Article Retrieval in Context for Event-centric Narrative Creation T a b l e . : E x a m p l e s fr o m t h e W a P od e v . s e t w it hh i gh /l o w l e x i ca l ov e r l a pb e t w ee n q a nd d ∗ ( t op / bo tt o m ) . Q u er y q L i nk s e n t e n ce R e l e va n t a r t i c l e d ∗ T o p - r a nk e d a r t i c l e RR F R a nk o f d ∗ Q u er y e v e n t e n a rr a t i v e ’ s c o n t e x t c H ea d li n e & L ea d D a yd i ff . H ea d li n e & L ea d D a yd i ff . B M B E R T RR F -r ece n c y RR F W h a t ‘ a rr e s t ’ m ea n s f o r t h e C a n a d i a n s d e t a i n e d i n C h i n a — a nd t h ee p i c b a ttl e ov e r H u a w e i . B E I J I NG — O v e r t h e p a s t fi v e m on t h s , a s B e i - ji ng a nd W a s h i ng t onh a v ee x c h a ng e d fi r e on t r a d ea nd t ec hno l ogy , t w o C a n a d i a n m e nh a v e b ee nh e l d i nn ea r- i s o l a ti on i n C h i n e s e d e t e n - ti on f ac iliti e s . L a s t w ee k , a C h i n e s ec ou r t s c h e du l e d S c h e ll e nb e r g ’ s a pp ea l h ea r i ng t ob e g i nhou r s a f t e r M e ng f ace d a n e x t r a d iti onh ea r i ng i n V a n c ouv e r . A f t e r a C a n a d i a n c ou r t pu s h e db ac k a d ec i s i on i n M e ng ’ s ca s e , t h e C h i n e s ec ou r t a nnoun ce d it w ou l dd e l a y a r u li ngon w h e t h e r S c h e ll e nb e r g w ou l db e pu tt od ea t h . C h i n e s ec ou r t d e l a y s r u li ngon C a n a d i a n ’ s d ea t h s e n - t e n cea pp ea l . B E I J I NG — A C h i n e s ec ou r t h a s d e l a y e d r u li ngon a C a n a d i a n m a n ’ s a pp ea l a g a i n s t h i s d ea t h s e n - t e n ce f o r d r ug s m ugg li ng , j u s t hou r s a f t e r a C a n a d i a n c ou r t s e t a S e p t e m b e r d a t e f o r t h e n e x t h ea r i ng i n a n e x - t r a d iti on ca s ea g a i n s t a t op C h i n e s ee x ec u ti v e . d ∗ F a i rf a x r ace f o r p r o s ec u t o r pu t s f o c u s onp ace o f c r i m - i n a lj u s ti ce r e f o r m . P o liti - ca l r ace s a r e u s u a ll y a bou t s t r i k i ng c on t r a s t s , bu ti n t h e fi r s t D e m o c r a ti c p r i m a r y f o r p r o s ec u t o r i n V i r g i n i a ’ s l a r g e s t c oun t y i n55y ea r s , bo t h ca nd i d a t e s g i v e t h e m - s e l v e s t h e s a m e titl e : p r o - g r e ss i v e . M o rr ogh , w ho w a s fi r s t e l ec t e d c o mm on w ea lt h ’ s a t - t o r n e y i n2007 , h a ss p e n t n ea r l y a ll o f h i s ca r ee r i n t h e p r o s ec u t o r’ s o f fi ce , w h e r e h e h a s w on s o m e h i gh - p r o fi l eca s e s a nd a vo i d e d m a j o r s ca nd a l . M o rr oghh e l p e d s ec u r e t h ec onv i c ti on s o f D . C . s n i p e r L ee B oyd M a l vo , s e r i a l k ill e r A l fr e d P r i e t o a nd m o r e r ece n tl y t h e M S - a ng m e m b e r s w hok ill e d a - y ea r- o l dg i r l . T h ee x ec u ti ono f A l fr e do P r i e t o : W it n e ss i ng a s e r i a l k ill e r’ s fi n a l m o m e n t s . J A R - R A TT , V a . — I ti s und e - n i a b l yd i s t u r b i ng t od r i v e t o t h e s c h e du l e dk illi ngo f a n - o t h e r . A hu rr i ca n e b r e w - i ng i n t h e d i s t a n ce , s li c i ng s t ea dy r a i n t h r ough t h e g r a yd a y . M on e y fr o m P A C f und e dby G e o r g e S o r o ss h a k e s upp r o s ec u t o rr ace s i n N o r t h - e r n V i r g i n i a . A po liti ca l ac ti on c o mm itt ee f und e dby D e m o c r a ti c m e g a dono r a ndb illi on a i r e G e o r g e S o r o s h a s m a d e l a r g ec on t r i bu ti on s t o t w oup s t a r t p r og r e ss i v eca n - d i d a t e s a tt e m p ti ng t oun s ea t D e m o c r a ti c p r o s ec u t o r s i n N o r t h e r n V i r g i n i a p r i m a r y r ace s . .7. Analysis T a b l e . : E x a m p l e s fr o m t h e W a P od e v . s e t w it h a r ece n t r e l e v a n t a r ti c l e w h e r e RR F r a nk s t h e r e l e v a n t a r ti c l ea tt h e t op , w h il e RR F - R ece n c y r a nk s itl o w e r . Q u er y q L i nk s e n t e n ce R e l e va n t a r t i c l e d ∗ T o p - r a nk e d a r t i c l e RR F - rece n c y R a nk o f d ∗ Q u er y e v e n t e n a rr a t i v e ’ s c o n t e x t c H ea d li n e & L ea d D a yd i ff . H ea d li n e & L ea d D a yd i ff . B M B E R T RR F -r ece n c y RR F C h i n a ’ s i n fl u e n ce on ca m - pu s c h ill s fr ee s p eec h i n A u s - t r a li a , N e w Z ea l a nd . S YD - N E Y — C h i n e s e s t ud e n t s pou r e d i n t o A u s t r a li aa nd N e w Z ea l a nd i n t h e hun - d r e d s o f t hou s a nd s ov e r t h e p a s t ea r s , p a y i ng s ti c k e r p r i ce s f o r un i v e r s it yd e g r ee s t h a t m a d e h i gh e r e du ca ti on a m ongbo t h c oun t r i e s ’ t op e xpo r t ea r n e r s . A f t e r y ea r s o ff ee li ng f o r t u - n a t ea bou tt h e i r ec ono m i c r e - l a ti on s h i p w it h C h i n a , A u s - t r a li a n s a r e s t a r ti ng t o w o rr y a bou tt h ec o s t . O n T hu r s d a y , a r u li ngp a r t y l a w m a k e r , A nd r e wH a s ti e , c o m p a r e d C h i n a ’ s e xp a n - s i on t o t h e r i s e o f G e r m a nyb e f o r e W o r l d W a rII , s ug - g e s ti ng it po s e d a d i r ec t m il - it a r y t h r ea t . T h r ea t fr o m C h i n a r eca ll s t h a t o f N az i G e r m a ny , A u s - t r a li a n l a w m a k e r s a y s . T h e W e s t ’ s a pp r o ac h t o c on t a i n - i ng C h i n a i s a k i n t o it s f a il - u r e t op r e v e n t N az i G e r- m a ny ’ s a gg r e ss i on , a n i n fl u - e n ti a l A u s t r a li a n l a w m a k e r w a r n e d , ea r n i ng a r e buk e fr o m B e iji ng w h il e h i gh - li gh ti ng t h e d i f fi c u lt y t h e U . S . a ll y f ace s i n w e i gh - i ng it ss ec u r it yn ee d s a g a i n s t ec ono m i c i n t e r e s t s . C h i n a ’ s m e dd li ng i n A u s - t r a li a — a nd w h a tt h e U . S . s hou l d l ea r n fr o m it . W h il e A m e r i ca n a tt e n ti on r e m a i n s f o c u s e don R u ss i a ’ s i n t e rf e r- e n ce i n t h e r e s i d e n ti a l e l ec ti on , A u s t r a li a — p e r- h a p s t h e U n it e d S t a t e s ’ c l o s - e s t a ll y — i s d e b a ti ng t h e d e - s i gn s t h a t a d i ff e r e n t c oun - t r y a lt og e t h e r h a s on it s po - liti ca l s y s t e m , ec ono m y a ndpub li c op i n i on . T h a t c oun t r y i s C h i n a . D e s p it e n a ti on a l s ec u r it y c on ce r n s , GO P l ea d e r M c - C a r t hyb l o c k e db i p a r ti s a nb i d t o li m it C h i n a ’ s r o l e i n U . S . t r a n s it . H ou s e M i no r it y L ea d e r K e v i n M c C a r t hy ( R - C a li f . ) b l o c k e d a b i p a r ti - s a n a tt e m p tt o li m it C h i n e s ec o m p a n i e s fr o m c on t r ac ti ng w it h U . S . t r a n s it s y s t e m s , a m ov e t h a t b e n e fi t e d a C h i n e s e gov e r n m e n t - b ac k e d m a nu f ac t u r e r w it h a p l a n ti nh i s d i s t r i c t , acc o r d i ng t o m u lti p l e p e op l e f a m ili a r w it h t h e m a tt e r . L a w m a k e r s fr e qu e n tl y t a k ea s t a n ce on l e g i s l a ti on t h a t c ou l d a ff ec t ca m p a i gn c on - t r i bu t o r s o r ho m e t o w n c o m - p a n i e s . B u t M c C a r t hy ’ s i n - t e r v e n ti on w a ss t r i k i ngb e - ca u s e t h ec l o s ea ll yo f P r e s - i d e n t T r u m p s ough tt op r o - t ec t C h i n e s e i n t e r e s t s a t a ti m e w h e n T r u m p a nd m a ny l a w m a k e r s on C a p it o l H ill a r ea tt e m p ti ng t o c u r b B e i - ji ng ’ s acce ss t o U . S . m a r- k e t s , p a r ti c u l a r l y i n i ndu s - t r i e s d ee m e dv it a lt on a ti on a l s ec u r it y . J u s tl a s t w ee k , T r u m ppu t C h i n e s e t e l ec o m g i a n t H u a w e i on a t r a d e“ b l ac k - li s t ” t h a t s e v e r e l y r e s t r i c t s it s acce ss t o U . S . t ec hno l ogy . T r u m p a d m i n i s t r a ti on c r ac k s do w nong i a n t C h i - n e s e t ec h fi r m , e s ca l a ti ng c l a s h w it h B e iji ng . T h e T r u m p a d m i n i s t r a ti onon W e dn e s d a y s l a pp e d a m a - j o r C h i n e s e fi r m w it h a n e x t r e m e p e n a lt y t h a t m a k e s it v e r yd i f fi c u lt f o r itt odobu s i n e ss w it h a ny U . S . c o m p a ny , a d r a m a ti ce s ca l a - ti ono f t h eec ono m i cc l a s hb e t w ee n t h e t w on a ti on s . T r u m p s a y s h e ’ ll s p a r e C h i n e s e t e l ec o m fi r m ZTE fr o m c o ll a p s e , d e f y i ng l a w - m a k e r s . P r e s i d e n t T r u m p s a i d l a t e F r i d a yh e h a d a ll o w e d e m b a ttl e d C h i n e s e t e l ec o mm un i ca ti on s g i a n t ZTE C o r p . t o r e m a i nop e nd e s p it e fi e r ce b i p a r ti s a noppo s iti onon C a p it o l H ill , d e f y i ng l a w m a k e r s w hoh a v e w a r n e d t h a tt h e hug e t ec hno l ogy c o m p a ny s hou l db e s e v e r e l ypun i s h e d f o r b r ea k i ng U . S . l a w . . News Article Retrieval in Context for Event-centric Narrative Creation T a b l e . : E x a m p l e s fr o m t h e W a P od e v . s e t w it h a no l d r e l e v a n t a r ti c l e w h e r e RR F -r ece n c y r a nk s t h e r e l e v a n t a r ti c l ea tt h e t op , w h il e RR F r a nk s itl o w e r . Q u er y q L i nk s e n t e n ce R e l e va n t a r t i c l e d ∗ T o p - r a nk e d a r t i c l e RR F R a nk o f d ∗ Q u er y e v e n t e n a rr a t i v e ’ s c o n t e x t c H ea d li n e & L ea d D a yd i ff . H ea d li n e & L ea d D a yd i ff . B M B E R T RR F -r ece n c y RR F M a x S c h e r ze r’ s knu c k l e i n j u r y m i gh t k ee ph i m fr o m b e i ng r ea dy f o r O p e n i ng D a y . T h e knu c k l ea tt h e b a s e o f M a x S c h e r ze r’ s r i gh t r i ng fi ng e r b eca m e t h e m o s t a n a l y ze d j o i n ti n t h e W a s h i ng t on N a ti on a l s ‘ c l ubhou s e on T hu r s d a y , kno c k i ng S t e ph e n S t r a s - bu r g ’ s r i gh t e l bo w ou t o f it s f a m ili a r s po tli gh t , a ndd e li v e r i ng a nun e xp ec t e db l o w t o t h eea r l y - s ea s on s t a b ilit yo f t h e N a ti on a l s ’r o t a ti on . S c h e r ze r e xp ec t e d t h e s p r a i n t oh ea l w it h r e gu l a rr e s ti n t h e o ff s ea s on . B u tt h e s y m p - t o m s d i dno ti m p r ov e by D e - ce m b e r , w h e n a no t h e r M R I e x a m r e v ea l e d t h e fr ac t u r e . A m on t h l a t e r , t h e fr ac t u r e s till h a dno t h ea l e d , s oh e t o l d T ea m U S A M a n a g e r J i m L e y l a ndh e w ou l dno t b ea b l e t op it c h i n t h e W o r l d B a s e b a ll C l a ss i c . M a x S c h e r ze r w on ’ t p it c h i n W BC b eca u s e o f s t r e ss fr ac t u r e i n fi ng e r . N a ti on a l s ace M a x S c h e r ze r , on e o f t h e fi r s t a ndh i gh e s t - p r o fi l e p l a y - e r s t o c o mm itt op l a y f o r t h e U n it e d S t a t e s i n t h e up c o m - i ng W o r l d B a s e b a ll C l a ss i c , w ill no t p a r ti c i p a t e i n t h e t ou r n a m e n t b eca u s e o f “ t h e ongo i ng r e h a b ilit a ti on s t r e ss fr ac t u r e i n t h e knu c k l e o f h i s r i gh t r i ng fi ng e r , ” t h ec l ub a nnoun ce d M ond a y a f- t e r noon i n a s t a t e m e n t . N a ti on a l s p l ace S t e ph e n S t r a s bu r gond i s a b l e d li s t ( a g a i n ) w it hp i n c h e dn e r v e . M I A M I — N o t h i ng a p - p ea r e d a m i ss f o r S t e ph e n S t r a s bu r gon W e dn e s d a y . H e p l a y e d ca t c h a t M ill e r P a r k i n M il w a uk eea ss c h e du l e d , a d a yb e f o r e h e w a s t o t a k e t h e m ound f o r t h e W a s h i ng t on N a ti on a l s a g a i n s tt h e M i a m i M a r li n s on T hu r s d a yn i gh t . B u tt h e t h r o w i ng s e ss i ond i dn ’ t go w e ll . A m y s t e r i ou ss i c kn e ss h a s k ill e dn ea r l y100 c h il d r e n i n I nd i a . C ou l d lit c h i fr u it b e t h eca u s e? N E W D EL H I — T h ec h il d r e ngo t o s l ee p a s b e s tt h e y ca n i n t h e s w e lt e r- i ngh ea t . E a r l y i n t h e m o r n - i ng , t h e f e v e r s p i k e s a nd t h e s e i z u r e s b e g i n . I n A ugu s t , I nd i a w it - n e ss e d a no t o r i ou s ou t b r ea ko f e n ce ph a liti s i n t h ec it yo f G o r a khpu r i n t h e n e i ghbo r- i ng s t a t e o f U tt a r P r a d e s h . M o r e t h a n30 c h il d r e nd i e dov e r t w od a y s a t on e ho s p i - t a l a f t e r it s oxyg e n r a nou t . ‘I t ’ s a m a ss ac r e ’ : A tl ea s t c h il d r e nd i e i n I nd i a nho s p it a l a f t e r oxyg e n i s c u t o ff . N E W D EL H I — O n e byon e , t h e i n f a n t s a nd c h il - d r e n s li pp e d a w a y T hu r s d a yn i gh t , t h e i r p a r e n t s w a t c h - i ngh e l p l e ss l y a s oxyg e n s up - p li e s a tt h e gov e r n m e n t ho s - p it a l r a nd a ng e r ou s l y l o w . ‘I ti s ho rr i d ’ : I nd i a r o a s t s und e r h ea t w a v e w it h t e m - p e r a t u r e s a bov e e g r ee s . N E W D EL H I — W h e n t h e t e m p e r a t u r e t opp e d120d e - g r ee s ( C e l s i u s ) , r e s i d e n t s o f t h e no r t h e r n I nd i a n c it yo f C hu r u s t opp e dgo i ngou t s i d ea nd a u t ho r iti e ss t a r t e dho s - i ngdo w n t h e b a k i ng s t r ee t s w it h w a t e r . .7. Analysis ( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Query entities Avg. IDF M RR BM25BERTRRF-recencyRRF (a) WaPo ( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Query entities Avg. IDF M RR BM25BERTRRF-recencyRRF (b) Guardian
Figure 6.6: MRR vs avg. IDF of the entities in the query q .article is old and RRF-recency ranks it at the top of the ranking, while RRF ranks itlower. In the first example, the relevant article discusses a development on the injury ofScherzer, a player of the Washington Nationals team, and RRF-recency correctly bringsthat at the top position. However, RRF ranks a more recent event at the top positionthat discusses an injury of a different player of the same team. In the second example,RRF brings at the top position an article that discusses an event about India that is morerecent than the one that the relevant article discusses, however the article is off-topic.The above phenomena suggest that more sophisticated methods that model recencyshould be explored for this task. For instance, it would be interesting to try to predictwhich queries are of temporal nature based on the characteristics of the underlyingcollection [90]. However, methods that build on features derived from user interactionsare not applicable to our setting [55]. Entities play a central role in event-centric narratives, especially in the news do-main [154]. We examine whether entity popularity affects retrieval performance in ourtask by measuring the IDF (Inverse Document Frequency) of entities in the query [115].An entity with a high IDF in the collection is less popular than an entity with a low IDF.In Figure 6.6 we show the performance depending on the average IDF of the entitiesin the query in the underlying collection. We observe that the rankers that use the queryand article text (BM25, BERT, RRF-recency) perform worse for queries with morepopular entities (low IDF) than for queries with less popular entities. This is becausepopular entities appear in multiple events, and thus there are many potentially relevantarticles for a query. We also see that RRF, which takes recency into account, is morerobust to entity popularity. This might also be related to the fact that a recent event thatinvolves a popular entity is more likely to be relevant in general than a less recent eventthat involves the same entity (also see examples in Section 6.7.2, Table 6.7). 103 . News Article Retrieval in Context for Event-centric Narrative Creation
Recall that we do not use the link sentence as part of the query (see Section 6.3.1). Thus,our rankers are not aware of its content. However, we found that in some cases thelink sentence contains information that is crucial for the connection of the complete narrative and the relevant news article. Thus, in such cases, the query event e and thenarrative’s context c are not sufficient. Table 6.9 shows examples of such cases. Notethat in the first example, the relevant article was not even retrieved in the top-1000 ofthe initial retrieval step (see Section 6.4.1). In the second example, the relevant articleis ranked very low by all rankers.One direction for future work would be to detect parts of the link sentence thatcontain such crucial information and add them to the narrative’s context c . This couldbe performed as a manual annotation task or modeled as a prediction task [87]. Recent work on developing automatic applications to support writers has focused ondesigning tools that track and filter information from social media to support journal-ists [52, 213]. Cucchiarelli et al. [44] track the Twitter stream and Wikipedia edits tosuggest potentially interesting topics that relate to a new event that a writer can includein their narrative when reporting on the event. In contrast, instead of relying on externalsources, we aim to retrieve news articles that describe events from the past that can helpthe writer expand the incomplete narrative about a specific event.Perhaps the closest to our task are the works by Maiden and Zachos [114] andMacLaughlin et al. [113]. Maiden and Zachos [114] focus on suggesting articles thatwould help journalists discover new, creative angles on a current incomplete narrative.The difference with our work is that they aim to suggest creative angles on articles andretrieve articles depending on the angle the writer selects. In addition, they evaluatetheir system in a living lab scenario, whereas we create static retrieval datasets fromhistorical data and use them to train ranking functions. Evaluating our system in a livinglab scenario would be a promising direction for future work.MacLaughlin et al. [113] retrieve paragraphs that contain quotes from politicalspeeches and conference transcripts, so that writers can use them in their incompletenarratives. Even though their retrieval task definition is similar to ours, our task differsin that our unit of retrieval is a news article from a large news article collection insteadof a paragraphs from a single document (e.g., a political speech). Moreover, our unit ofretrieval (article) is timestamped, which makes the temporal aspect prominent in ourtask.
The task of context-aware citation recommendation is to find articles that are relevantto a specific piece of text a writer has generated [76]. It has mainly been studiedin the scientific domain [57, 82, 86, 158], but also in the news domain [102]. The104 .8. Related work T a b l e . : E x a m p l e s fr o m t h e W a P od e v . s e t w h e r e t h e li nk s e n t e n cec on t a i n s c r u c i a li n f o r m a ti on f o r t h ec onn ec ti ono f t h ec o m p l e t e n a rr a ti v ea nd t h e r e l e v a n t a r ti c l e . Q u er y q L i nk s e n t e n ce R e l e va n t a r t i c l e d ∗ T o p - r a nk e d a r t i c l e RR F R a nk o f d ∗ Q u er y e v e n t e n a rr a t i v e ’ s c o n t e x t c H ea d li n e & L ea d D a yd i ff . H ea d li n e & L ea d D a yd i ff . B M B E R T RR F -r ece n c y RR F A m e r i ca n s a r e d r i nk i ng m o r e ‘ gou r m e t ’ c o ff ee . T h i s do e s n ’ t m ea n t h e y ’r e d r i nk i ngg r ea t c o ff ee . T h e N a ti on a l C o ff ee A ss o c i a ti on U S A r ece n tl yd r opp e d it s a nnu a l s u r v e y r e s u lt s , a nd , a s u s u a l , t h e r e ’ s a w ea lt ho f i n f o r m a ti on t o s i f tt h r ough t ob e tt e r und e r s t a nd t h e s t a t e o f c o ff ee d r i nk i ng i n A m e r i ca . A cc o r d i ng t o t h i s y ea r’ s fi nd - i ng , c o ff ee r e m a i n s t h e N o . r i nk : S i x t y - t h r ee p e r- ce n t o f t h e r e s pond e n t ss a i d t h e yd r a nk ac o ff ee b e v e r a g e ( d r i p c o ff ee , e s p r e ss o , l a tt e , c o l db r e w , U n i c o r n F r a ppu c - c i no , e t c . ) t h e p r e v i ou s d a y , ac li c kdo w n fr o m e r- ce n ti n2018 . B y t h e w a y , t h e s ec ond - m o s t c on s u m e db e v e r a g e w a s un fl a vo r e dbo ttl e d w a t e r , w h i c h m i gh t h e l p e xp l a i n t h e G r ea t P ac i fi c G a r b a g e P a t c h . P l a s ti c w it h i n t h e G r ea t P a - c i fi c G a r b a g e P a t c h i s ‘ i n - c r ea s i ng e xpon e n ti a ll y , ’ s c i - e n ti s t s fi nd . S e v e n t y - n i n e t hou s a nd t on s o f p l a s ti c d e - b r i s , i n t h e f o r m o f . t r il - li onp i ece s , no w o cc upy a n a r ea t h r ee ti m e s t h e s i ze o f F r a n ce i n t h e P ac i fi c O cea nb e t w ee n C a li f o r n i aa nd H a w a ii , a s c i e n ti fi c t ea m r e po r t e don T hu r s d a y . N / AN / AN / AN / AN / AN / A T u r k e y ’ s e l ec ti on ss ho w t h e li m it s o f E r dog a n ’ s n a ti on a l - i s m . A h ea do f l o ca l e l ec - ti on s t h r oughou t h i s c oun t r y l a s t w ee k e nd , T u r k i s h P r e s i - d e n t R ece p T a yy i p E r dog a n r e s o r t e d t oh i s u s u a lt ac ti c s . H eca s t s o m e o f h i s r u li ngp a r t y ’ s oppon e n t s a s t r a it o r s i n l ea gu e w it h t e rr o r i s t s . T h e r e ’ s a b r o a d e r s t o r y t ob e t o l d , a s w e ll . W e ll b e f o r e P r e s i d e n t T r u m p , I nd i a n P r i m e M i n - i s t e r N a r e nd r a M od i o r H ung a r i a n P r i m e M i n i s t e r V i k t o r O r b a n , E r dog a n a rr i v e d a tt h e po liti c s o f t h eze it g e i s t . T r u m p ’ s popu li s m i s a bou t c r ea ti ngd i v i s i on , no t un it y . P r e s i d e n t T r u m pb e g i n s h i s t h i r d w ee k i no f fi ce w it h t h e w o r s t a pp r ov a l r a ti ng s o f a nyn e wA m e r i ca np r e s i d e n t s i n ce po ll s b e g a n t r ac k i ng s u c h r e s u lt s . S t unn i ng s e t b ac k s i n T u r k e y ’ s e l ec ti on s d e n t E r- dog a n ’ s a u r a o f i nv i n c i b ilit y . I S T AN B U L — T u r k i s h P r e s i d e n t R ece p T a yy i p E r dog a n f ace d t h e p r o s p ec t M ond a yo f a s ti ng i ng e l ec - t o r a l d e f ea ti n I s t a nbu l , t h ec it y w ho s e po liti c s h e do m i n a t e d f o r a qu a r t e r o f ace n t u r y , w it hvo t e r e s u lt ss ho w i ng w h a t a pp ea r e d t ob ea noppo s iti onv i c t o r y i n t h e r ace f o r t h ec it y ’ s m a yo r . . News Article Retrieval in Context for Event-centric Narrative Creation main difference of the aforementioned works and our task is that we aim to retrievearticles to expand existing incomplete narratives instead of finding citations for completenarratives. Events are the starting points of narrative news items. Recent work has focused onextracting and characterizing events from large streams of documents [32] and extractingthe most dominant events from news articles [36]. In our work, we assume that a newsarticle is associated with a single main event, which is described by the article’s headlineand lead paragraph [37].More related to our task is work focused on retrieving events given a query event [102,163]. However, this work does not consider additional context in the query as we doand thus it is not directly comparable to ours.
In this chapter, we proposed and studied the task of news article retrieval in contextfor event-centric narrative creation. We proposed an automatic dataset constructionprocedure and showed that it can generate reliable evaluation sets for this task. Usingthe generated datasets, we compared lexical and semantic rankers and found that theyare insufficient. We found that combining those rankers with one that ranks articlesby their reverse chronological order significantly improves retrieval performance overthose rankers alone.Our analysis showed that the vocabulary gap for this task is large, and thereforemore advanced methods for semantic matching are needed. This could be achieved byexploiting external knowledge about events stored in knowledge graphs [67]. To this endwe aim to build on insights gained from our studies in Chapters 2, 3 and 4 to improvesemantic matching. For instance, we could first detect KG facts in the query and thearticles and then use the method we proposed in Chapter 2 to retrieve descriptions ofthe detected KG facts. The retrieved descriptions can then be used to provide additionalknowledge to the BERT ranker and thus improve semantic matching [137].Moreover, our analysis showed that the temporal aspect is prominent in this retrievaltask, which was not the case for the tasks we studied in the previous chapters of thisthesis. Therefore, future work would aim to find more robust ways to incorporate thetemporal aspect in the ranking function [90].Furthermore, we found that this task is more challenging when the query eventinvolves entities that appear more frequently in the collection, which we plan to furtherstudy in the future. Another direction for future work is to categorize queries in relationto their discourse function in the narrative [174, 175], for example in relation to theirfunction with respect to the main event of the narrative [37], and develop specializedrankers for each category.We found that in some cases the link sentence contains crucial information for theconnection between the complete narrative and the relevant news article. However,since the link sentence is not part of the query according to our dataset construction106 .9. Conclusion and Future Work procedure, the constructed query may miss this key piece of information to capturethe connection. For future work, we aim to address this limitation by detecting suchinformation in the link sentence and adding it to the query, or by using natural languagegeneration techniques to fill in these blanks automatically.Finally, it is important to note that even though our dataset construction procedurecan generate reliable retrieval datasets, the fact that we only have a single relevant articlefor each query may be limiting as more than one article may be relevant. Thus, some ofour findings might be an artifact of that procedure and not the task itself. We plan toovercome this limitation in future work by asking journalists to qualitatively assess theoutput of different rankers to enrich the automatically constructed datasets with morerelevant articles per query [113, 114]. 107
Conclusions
In this thesis, we studied three research themes aimed at supporting search engineswith knowledge and context: (1) making structured knowledge more accessible to theuser, (2) improving interactive knowledge gathering, and (3) supporting knowledgeexploration for narrative creation. We studied several algorithmic tasks within thesethemes and proposed solutions to address them.In this concluding chapter, we first revisit the research questions that we introducedin Chapter 1 and describe our main findings in Section 7.1. In Section 7.2, we discusslimitations and future directions.
Within this research theme we asked and answered three research questions motivatedby the need of presenting knowledge graph (KG) facts to users in a natural way. InChapter 2, we asked the following question:
RQ1
Given a KG fact and a text corpus, can we retrieve textual descriptions of the factfrom the text corpus?To answer this question, we formalized the task of retrieving textual descriptions ofKG facts from a corpus of sentences. We developed a method for this task that consistsof two steps. First, we extract and enrich candidate sentences from the corpus andthen rank them by how well they describe the KG fact. In the first step, we detectsentences in the corpus that contain surface forms of any of the two entities in the KGfact and apply coreference resolution and entity linking to enrich them. In the secondstep, we rank the extracted sentences using learning to rank, that combines a rich set offeatures of different types. To evaluate our method, we construct a manually annotateddataset that contains descriptions of KG facts that involve people. We found that ourmethod improves performance over state-of-the-art sentence retrieval methods and thatall groups of features contribute to retrieval performance, with relation-based featuresbeing the most important. Moreover, we found that training relationship-dependentrankers is beneficial to improving retrieval performance. Importantly, we also found thatalmost one third of the facts in our dataset did not correspond to any relevant sentence in109 . Conclusions the corpus. This is usually the case for facts of which the entities are less popular. Notbeing able to provide a meaningful description of certain facts limits the applicabilityof our method in real world scenarios. This finding led us to the following researchquestion in Chapter 3:
RQ2
Given a KG fact, can we automatically generate a textual description of the factin the absence of an existing description?To answer this question, we formalized the task of generating textual descriptions ofKG facts. We proposed a method that first generates sentence templates for a specificrelationship and then, given a specific KG fact selects the most relevant template andfills it with information from the KG to create a novel sentence. In order to createsentence templates, we designed a graph-based algorithm that combines informationcontained in existing sentences and the KG. In order to select the most relevant templatefor a KG fact, we designed a supervised feature-based scoring function. To evaluateour method, we automatically extracted a dataset for KG fact description generationand performed both automatic and manual evaluation. We found that our methodcan generate grammatically correct and generally informative descriptions, and that asupervised scoring function outperforms an unsupervised one for selecting templates.In addition, our error analysis showed that generating KG fact descriptions that arevalid under the KG closed-world assumption is challenging and needs to receive moreattention.Next, in Chapter 4 we turned to a closely related problem and asked the followingquestion:
RQ3
Can we contextualize a KG query fact by retrieving other, related KG facts?To answer this question, we formalized the problem of contextualizing KG facts as aretrieval task. We designed NFCM, a neural fact contextualization method that firstgenerates a set of candidate facts that are part of the immediate neighborhood of thequery fact in the KG, and subsequently ranks the candidate facts by how relevant theyare to the query fact. We designed a neural network ranking model that combinesinformation from multiple paths connecting the query and the candidate facts in the KGusing recurrent neural networks to learn automatic features. We further augmented therepresentation power of this model by using existing and novel hand-crafted features.Since it is expensive to manually obtain human-curated training data to train this model,we turned to distant supervision to automatically generate training data for this task. Weevaluated NFCM using a human-curated dataset separate from the one used for distantsupervision. We found that when trained on distant supervision, NFCM significantlyoutperforms several heuristic baselines on this task. Additionally, we found that NFCMbenefits from both automatically learned and hand-crafted features. Finally, we foundthat NFCM is relatively robust to the number of training data for each relationship.
We then moved to the theme of improving interactive knowledge gathering and studiedmulti-turn passage retrieval as an instance of conversational search. In Chapter 5, weasked the following question:110 .1. Main Findings
RQ4
Can we use query resolution to identify relevant context and thereby improveretrieval in conversational search?To answer this question, we formulated the task of query resolution for conversationalsearch as a term classification task. We proposed QuReTeC, a neural term classificationmodel based on bidirectional transformers, more specifically BERT. QuReTeC encodesthe conversation history and the current turn query and predicts which terms fromthe history are relevant to the current turn. We integrated QuReTeC in a standard,two-step retrieval pipeline by appending the terms predicted as relevant to the currentturn query. We performed evaluation both in terms of term classification and retrievalperformance using a recently constructed multi-turn passage retrieval dataset. We foundthat QuReTeC significantly outperforms state-of-the-art methods on this task whentrained on gold standard query resolutions. Furthermore, we found that QuReTeC isrobust across conversation turns. Since collecting such gold standard query resolutionsfor training QuReTeC might be cumbersome, we designed a distant supervision methodthat automatically generates training data for query resolution using query-passagerelevance labels. We found that this distant supervision method can substantially reducethe number of gold standard query resolutions required for training QuReTeC, a resultespecially important in low resource scenarios.
Our next study was in the theme of supporting knowledge exploration for narrativecreation. In Chapter 6, we asked the following question:
RQ5
Can we support knowledge exploration for event-centric narrative creation byperforming news article retrieval in context?To answer this question, we formalized the task of event-centric news article retrievalin context. We proposed an automatic retrieval dataset construction procedure that canproduce reliable datasets for this task. We generated two retrieval datasets using thisprocedure and used the generated datasets to evaluate automatic methods for this task.We found that an unsupervised combination of state-of-the-art lexical and semanticrankers and a ranker that ranks articles by reverse chronological order outperforms thoserankers alone. We performed an in-depth quantitative and qualitative analysis to acquireinsights into the characteristics of this task. We found that this task has a large vocabularygap, which highlights the need for semantic matching that takes into account structuredknowledge about events. In addition, we found that the temporal aspect is prominent inthis task and thus more advanced temporal query and collection characteristics needto be explored. Moreover, we found that this task is more challenging for queries thatcontain entities that appear more frequently in the underlying news article collection.Last, we found that our dataset construction procedure is sometimes prone to generatingqueries that are not sufficiently defined, which is a clear future work direction.We now reflect on the main question we asked in Chapter 1, namely how to supportsearch engines in leveraging knowledge while accounting for different types of context.In the first part of this thesis (Chapters 2, 3 and 4), we proposed tasks and methods111 . Conclusions that make structured knowledge more accessible to the user (when the search engineproactively provides context to enrich search results) by retrieving existing or generatingnovel descriptions of KG facts, and also by contextualizing KG facts with other, relatedfacts. In the second part of this thesis (Chapter 5), we proposed a method for queryresolution that improves interactive knowledge gathering in conversational search byadding missing context from the conversation history to the current turn query. In thethird part of this thesis (Chapter 6), we proposed and studied the task of retrievingnews articles that are relevant to the user’s broad query (the query event) and a contextthat further specifies the query, thereby supporting knowledge exploration for narrativecreation.
In this section, we discuss limitations of our study and directions for future work thatwould overcome those limitations and further expand our work.
Validity of KG fact descriptions
Ensuring the validity of automatically generatedKG fact descriptions is crucial when presenting such descriptions to the user [65]. InChapter 3 we found that generating valid KG fact descriptions is a challenging task.This is a challenge not only for template-based generation methods such as ours butalso for neural sequence to sequence generation methods [12, 116, 200]. A possibledirection towards overcoming this challenge is to learn discrete templates jointly withlearning how to generate [195]. Another possible direction is to learn to edit existingdescriptions instead of generating descriptions from scratch [72]. Moreover, it wouldbe interesting to assess the ability of recently developed large-scale pretrained languagemodels for generating valid KG fact descriptions [144].
Richness of KG fact descriptions
In Chapter 4, we proposed NFCM, a neural factcontextualization method. Relevant facts retrieved by NFCM can be used to improveKG fact description retrieval by better modeling the relevance of existing descriptions(Chapter 2). In addition, they can be used to select more informative templates in KGfact description generation (Chapter 3).
Source of KG fact descriptions
In Chapters 2, 3 and 4 we used Wikipedia as thesource of existing descriptions of KG facts. Using other sources of such descriptionscould widen the applicability of our proposed tasks and methods to less popular entities.Huang et al. [81] performed an initial exploration towards this direction by using webpages as the source of descriptions, with an application to KG fact description retrieval.Their results showed that using the web as the source of descriptions poses furtherchallenges that would be interesting to explore even further.
Query-dependent KG fact information
Deciding what information about a KG factto present in a SERP may depend on the user’s query [75]. Future work could develop112 .2. Future Directions query-dependent methods for all three tasks we considered: KG fact description retrievaland generation, and KG fact contextualization. A study with real search engine usersinteracting with KG fact information on SERPs could provide further insights into thisdirection.
Incorporating the system’s response
In Chapter 5, we followed the TREC CAsT2019 setup and only took into account the previous turn queries (the ones that precededthe current turn query) but not the passages retrieved by the system for those queries(the system’s response). In future work, we will evaluate QuReTeC on a more realisticscenario where the passages retrieved for the previous turn queries are also taken intoaccount.
Distant supervision for query resolution
In Chapter 5, we proposed a distant su-pervision method for reducing the amount of query resolution training data requiredto train QuReTeC. Our distant supervision method relies on query-passage relevancelabels. Future work could address how to combine our distant supervision method withmethods that generate relevance labels with weak supervision [49], pseudo-relevancefeedback [98] or user signals [88]. Also, we would like to explore noise reductionmethods to improve the quality of the distant supervision signal [156].
Term classification and rewriting for query resolution
In Chapter 5, we formulatedquery resolution for conversational search as a term classification task. This gave usflexibility not only in terms of modeling but also in terms of where we can get thesupervision signal from. In two studies contemporaneous to ours, query resolution wasformulated as a sequence generation task [179, 209]. Combining the strengths of bothformulations of the query resolution task could result in developing more powerfulmodels.
Specialized rankers in low resource settings
In Chapter 5, we focused on queryresolution for conversational search and used existing rankers for both the initial retrievaland reranking steps. State-of-the-art neural ranking models rely on large-scale annotatedranking datasets that are not yet available in conversational search [46, 203]. Therefore,future work could develop specialized rankers for conversational search in low resourcesettings, possibly by learning to perform query resolution and ranking in a joint manner.
Incorporate structured knowledge about events
In Chapter 6, we found that thevocabulary gap for the retrieval task we studied is large, and that the retrieval methods weconsidered are not able to effectively account for that. One possible direction for futurework is to incorporate structured knowledge about events (and the entities involved inthem) in the retrieval methods [67, 154]. Such knowledge includes relationships betweenentities (which we studied in Chapters 2, 3 and 4), or sub-event relations [10, 68]. 113 . Conclusions
Temporal aspect
In Chapter 6, we found that a simple combination of lexical and se-mantic rankers with a ranker that ranks articles by reverse chronological order improvesperformance when the relevant article is recent but harms performance otherwise, asexpected. In future work, we aim to incorporate the temporal aspect in the rankingfunction in a more robust way. A possible way to achieve that is to identify temporal phe-nomena such as trending terms or entities in the underlying news article collection [90]or in external sources such as social media [44].
Dataset construction
In Chapter 6, we proposed a dataset construction procedurethat can produce reliable datasets for this task. The main limitation of this procedure isthat only one article is relevant for each query, even though more than one article maybe relevant. In future work, we aim to ask experts (journalists) to qualitatively assess theoutput of different rankers to enrich the automatically constructed datasets with morerelevant articles per query [113, 114]. Another limitation of our dataset constructionprocedure is that in some cases the query (usually the narrative context) misses crucialinformation for the connection of the complete narrative and the relevant article that iscontained in the link sentence. In future work we aim to ask experts to manually addsuch missing information to the narrative context. Since this task can be cumbersome,we will try to semi-automate this procedure by casting this as a prediction task [87].
Specialized rankers per discourse function
Previous work has studied how to cat-egorize parts of existing narratives according to their discourse function with respectto the main event of the narrative [37, 175]. We hypothesize that, in the narrativecreation task we studied in Chapter 6, queries that serve a certain discourse functionwould have relevant news articles of specific characteristics. In future work, we aimto categorize queries to different discourse functions and perform manual analysis tovalidate this hypothesis. We will then design specialized rankers for each category toimprove retrieval effectiveness.114 ibliography [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard,et al. Tensorflow: A system for large-scale machine learning. In
OSDI . USENIX, 2016. (Cited onpage 55.)[2] N. Abdul-jaleel, J. Allan, W. B. Croft, O. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. Umassat trec 2004: Novelty and hard. In
TREC . NIST, 2004. (Cited on pages 68 and 76.)[3] A. Agarwal, H. Raghavan, K. Subbian, P. Melville, R. D. Lawrence, D. C. Gondek, and J. Fan.Learning to rank for robust question answering. In
CIKM . ACM, 2012. (Cited on page 13.)[4] B. Aleman-Meza, C. Halaschek-Weiner, I. B. Arpinar, C. Ramakrishnan, and A. P. Sheth. Rankingcomplex relationships on the semantic web.
IEEE Internet Computing , 9, 2005. (Cited on page 59.)[5] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with ahierarchical topic model. In
ACL . ACL, 2012. (Cited on page 59.)[6] M. Aliannejadi, H. Zamani, F. Crestani, and W. B. Croft. Asking Clarifying Questions in Open-DomainInformation-Seeking Conversations. In
SIGIR . ACM, 2019. (Cited on page 1.)[7] J. Allan, J. Aslam, N. Belkin, C. Buckley, J. Callan, B. Croft, S. Dumais, N. Fuhr, D. Harman, D. J.Harper, D. Hiemstra, T. Hofmann, E. Hovy, W. Kraaij, J. Lafferty, V. Lavrenko, D. Lewis, L. Liddy,R. Manmatha, A. McCallum, J. Ponte, J. Prager, D. Radev, P. Resnik, S. Robertson, R. Rosenfeld,S. Roukos, M. Sanderson, R. Schwartz, A. Singhal, A. Smeaton, H. Turtle, E. Voorhees, R. Weischedel,J. Xu, and C. Zhai. Challenges in information retrieval and language modeling: report of a workshopheld at the center for intelligent information retrieval, University of Massachusetts Amherst, September2002.
ACM SIGIR Forum , 37(1), 2003. (Cited on page 1.)[8] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In
SIGIR .ACM, 2003. (Cited on page 13.)[9] T. Althoff, X. L. Dong, K. Murphy, S. Alai, V. Dang, and W. Zhang. Timemachine: Timeline generationfor knowledge-base entities. In
KDD . ACM, 2015. (Cited on page 28.)[10] J. Araki, L. Mulaffer, A. Pandian, Y. Yamakawa, K. Oflazer, and T. Mitamura. Interoperable Annotationof Events and Event Relations across Domains. In isa-14: The 14th Joint ACL - ISO Workshop onInteroperable Semantic Annotation . ACL, 2018. (Cited on page 113.)[11] R. Baeza-Yates, B. Ribeiro-Neto, et al.
Modern information retrieval . ACM, 1999. (Cited on page 13.)[12] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. In
ICLR , 2015. (Cited on pages 48 and 112.)[13] P. Bailey, N. Craswell, R. W. White, L. Chen, A. Satyanarayana, and S. Tahaghoghi. Evaluatingwhole-page relevance. In
SIGIR . ACM, 2010. (Cited on page 1.)[14] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra,T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. MS MARCO: A HumanGenerated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 , 2018. (Citedon page 78.)[15] H. Bast, B. Bj¨orn, and E. Haussmann. Semantic search on text and knowledge bases.
FnTIR , 10(2-3),2016. (Cited on page 41.)[16] N. J. Belkin. Anomalous states of knowledge as a basis for information retrieval.
CJIS , 5(1), 1980.(Cited on pages 2 and 67.)[17] N. J. Belkin, C. Cool, A. Stein, and U. Thiel. Cases, scripts, and information-seeking strategies: Onthe design of interactive information retrieval systems.
Expert Systems with Applications , 9(3), 1995.(Cited on page 67.)[18] P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui. Modeling theimpact of short- and long-term behavior on search personalization. In
SIGIR . ACM, 2012. (Cited onpage 1.)[19] C. Bhagavatula, S. Feldman, R. Power, and W. Ammar. Content-Based Citation Recommendation. In
NAACL-HLT . ACL, 2018. (Cited on page 2.)[20] R. Blanco and H. Zaragoza. Finding support sentences for entities. In
SIGIR . ACM, 2010. (Cited onpages 13, 15, and 16.)[21] R. Blanco, B. B. Cambazoglu, P. Mika, and N. Torzec. Entity recommendations in web search. In
ISWC . Springer, 2013. (Cited on pages 1, 11, 28, and 41.)[22] R. Blanco, G. Ottaviano, and E. Meij. Fast and space-efficient entity linking for queries. In
WSDM .ACM, 2015. (Cited on pages 27 and 41.)[23] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graphdatabase for structuring human knowledge. In
SIGMOD . ACM, 2008. (Cited on pages 30, 36, 43, . Bibliography and 51.)[24] A. Borisov, P. Serdyukov, and M. de Rijke. Using metafeatures to increase the effectiveness of latentsemantic models in web search. In
WWW . ACM, 2016. (Cited on page 49.)[25] H. Bota, K. Zhou, and J. M. Jose. Playing your cards right: The effect of entity cards on searchbehaviour and workload. In
CHIIR . ACM, 2016. (Cited on pages 1 and 41.)[26] L. Breiman. Random forests.
Mach. Learn. , 45(1), 2001. (Cited on page 19.)[27] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In
WWW . Elsevier,1998. (Cited on page 2.)[28] C. J. Burges, K. M. Svore, P. N. Bennett, A. Pastusiak, and Q. Wu. Learning to rank using an ensembleof lambda-gradient models. In
Yahoo! Learning to Rank Challenge , 2011. (Cited on page 13.)[29] B. Carterette, E. Kanoulas, M. Hall, and P. Clough. Overview of the trec 2014 session track. In
TREC .NIST, 2014. (Cited on pages 66 and 67.)[30] B. Carterette, P. Clough, M. Hall, E. Kanoulas, and M. Sanderson. Evaluating retrieval over sessions:The trec session track 2011-2014. In
SIGIR . ACM, 2016. (Cited on page 66.)[31] D. Caswell and K. D¨orr. Automated Journalism 2.0: Event-driven narratives.
Journalism Practice , 12(4), 2018. (Cited on pages 90 and 91.)[32] A. Chaney, H. Wallach, M. Connelly, and D. Blei. Detecting and Characterizing Events. In
EMNLP .ACL, 2016. (Cited on pages 91 and 106.)[33] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In
CIKM . ACM, 2009. (Cited on page 20.)[34] M. Cho, K. Alahari, and J. Ponce. Learning graphs to match. In
ICCV . IEEE, 2013. (Cited on page 59.)[35] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. QuAC:Question Answering in Context. In
EMNLP . ACL, 2018. (Cited on pages 65, 67, and 75.)[36] P. K. Choubey, K. Raju, and R. Huang. Identifying the Most Dominant Event in a News Article byMining Event Coreference Relations. In
NAACL-HLT . ACL, 2018. (Cited on page 106.)[37] P. K. Choubey, A. Lee, R. Huang, and L. Wang. Discourse as a Function of Event: Profiling DiscourseStructure in News Articles around the Main Event. In
ACL . ACL, 2020. (Cited on pages 89, 91, 106,and 114.)[38] A. Chuklin, I. Markov, and M. de Rijke.
Click Models for Web Search . Morgan & Claypool Publishers,August 2015. (Cited on page 1.)[39] E. Clark, A. S. Ross, C. Tan, Y. Ji, and N. A. Smith. Creative Writing with a Machine in the Loop:Case Studies on Slogans and Stories. In
IUI . ACM, 2018. (Cited on pages 2 and 4.)[40] J. Cohen. Multiple regression as a general data-analytic system.
Psychological Bulletin , 70(6, Pt.1),1968. (Cited on page 94.)[41] G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet andindividual rank learning methods. In
SIGIR . ACM, 2009. (Cited on pages 70, 78, and 95.)[42] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learningto construct knowledge bases from the world wide web.
Artificial Intelligence , 118(1–2), 2000. (Citedon page 11.)[43] W. B. Croft and R. H. Thompson. I3r: A new approach to the design of document retrieval systems.
JASIST , 38(6), 1987. (Cited on pages 2 and 67.)[44] A. Cucchiarelli, C. Morbidoni, G. Stilo, and P. Velardi. A topic recommender for journalists.
IRJ , 22(1), 2019. (Cited on pages 2, 4, 89, 104, and 114.)[45] J. S. Culpepper, F. Diaz, and M. D. Smucker. Research Frontiers in Information Retrieval: Reportfrom the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018).
SIGIR Forum ,52(1), 2018. (Cited on pages 1, 4, and 65.)[46] J. Dalton, C. Xiong, and J. Callan. Cast 2019: The conversational assistance track overview. In
TREC2019 . NIST, 2019. (Cited on pages 1, 4, 65, 67, 73, 75, 77, and 113.)[47] N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu.A web of concepts. In
ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems .ACM, 2009. (Cited on page 1.)[48] R. Das, A. Neelakantan, D. Belanger, and A. McCallum. Chains of reasoning over entities, relations,and text using recurrent neural networks. In
EACL . ACL, 2017. (Cited on pages 47 and 48.)[49] M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. Neural ranking models with weaksupervision. In
SIGIR . ACM, 2017. (Cited on pages 46, 73, 81, 85, and 113.)[50] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. In
NAACL-HLT . ACL, 2019. (Cited on pages 66, 70, 71,72, and 95.)
51] N. Diakopoulos.
Automating the News: How Algorithms Are Rewriting the Media . Harvard UniversityPress, 2019. (Cited on pages 2, 4, and 89.)[52] N. Diakopoulos, M. De Choudhury, and M. Naaman. Finding and assessing social media informationsources in the context of journalism. In
CHI . ACM, 2012. (Cited on pages 4, 89, and 104.)[53] L. Dietz, M. Verma, F. Radlinski, and N. Craswell. Trec complex answer retrieval overview. In
TREC .NIST, 2017. (Cited on page 75.)[54] A. Doko, M. ˇStula, and D. Stipaniˇcev. A recursive TF-ISF based sentence retrieval method with localcontext.
IJMLC , 3(2), 2013. (Cited on pages 13, 17, and 18.)[55] A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner, C. Liao, and F. Diaz. Towardsrecency ranking in web search. In
WSDM . ACM, 2010. (Cited on page 103.)[56] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang.Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In
KDD . ACM, 2014.(Cited on page 11.)[57] T. Ebesu and Y. Fang. Neural Citation Network for Context-Aware Citation Recommendation. In
SIGIR . ACM, 2017. (Cited on page 104.)[58] A. Elgohary, D. Peskov, and J. Boyd-Graber. Can You Unpack That? Learning to Rewrite Questions-in-Context. In
EMNLP . ACL, 2019. (Cited on pages 66, 67, 71, 73, 75, 77, and 78.)[59] M. Fadaee, A. Bisazza, and C. Monz. Data Augmentation for Low-Resource Neural MachineTranslation. In
ACL . ACL, 2017. (Cited on page 71.)[60] J. Fan, R. Hoffman, A. Kalyanpur, S. Riedel, F. Suchanek, and P. P. Talukdar.
AKBC-WEKEX: theJoint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction .ACL, 2012. (Cited on page 11.)[61] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. REX: explaining relationships between entity pairs.
VLDB Endowment , 5(3), 2011. (Cited on pages 13, 27, and 58.)[62] R. T. Fern´andez, D. E. Losada, and L. Azzopardi. Extending the language modeling framework forsentence retrieval to include local context.
Information Retrieval , 14(4), 2011. (Cited on pages 13and 20.)[63] K. Ganesan, C. Zhai, and J. Han. Opinosis: a graph-based approach to abstractive summarization ofhighly redundant opinions. In
COLING . ACL, 2010. (Cited on page 33.)[64] J. Gao, M. Galley, and L. Li. Neural Approaches to Conversational AI. In
ACL . ACL, 2018. (Cited onpages 1, 4, and 65.)[65] A. Gatt and E. Krahmer. Survey of the State of the Art in Natural Language Generation: Core tasks,applications and evaluation.
JAIR , 61, 2018. (Cited on page 112.)[66] D. Gkatzia, O. Lemon, and V. Rieser. Natural language generation enhances human decision-makingwith uncertain information. In
ACL , 2016. (Cited on pages 3 and 27.)[67] S. Gottschalk and E. Demidova. EventKG: A Multilingual Event-Centric Temporal Knowledge Graph.In A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam,editors,
ESWC 2018 . Springer, 2018. (Cited on pages 99, 106, and 113.)[68] S. Gottschalk and E. Demidova. HapPenIng: Happen, Predict, Infer—Event Series Completion in aKnowledge Graph. In
ISWC . Springer, 2019. (Cited on page 113.)[69] D. Guan, H. Yang, and N. Goharian. Effective structured query formulation for session search. In
TREC . NIST, 2012. (Cited on pages 66, 67, and 77.)[70] J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query. In
SIGIR . ACM, 2009. (Citedon page 1.)[71] K. Guu, J. Miller, and P. Liang. Traversing knowledge graphs in vector space. In
EMNLP . ACL, 2015.(Cited on pages 47 and 48.)[72] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. Generating Sentences by Editing Prototypes.
TACL ,6, 2018. (Cited on page 112.)[73] D. Hacker and N. Sommers.
A Writer’s Reference with Writing in the Disciplines . Macmillan, 2011.(Cited on page 92.)[74] M. Hagen, M. Potthast, M. V¨olske, J. Gomoll, and B. Stein. How Writers Search: Analyzing theSearch and Writing Logs of Non-fictional Essays. In
CHIIR . ACM Press, 2016. (Cited on page 2.)[75] F. Hasibi, K. Balog, and S. E. Bratsberg. Dynamic factual summaries for entity cards. In
SIGIR . ACM,2017. (Cited on pages 1, 41, 59, and 112.)[76] Q. He, J. Pei, D. Kifer, P. Mitra, and L. Giles. Context-aware citation recommendation. In
WWW .ACM, 2010. (Cited on pages 2 and 104.)[77] L. Hirschman and R. Gaizauskas. Natural language question answering: the view from here.
NLE , 7(04), 2001. (Cited on page 13.) . Bibliography [78] N. H¨ochst¨otter and D. Lewandowski. What Users See - Structures in Search Engine Results Pages.
Inf.Sci. , 2009. (Cited on page 1.)[79] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise andpairwise online learning to rank for information retrieval.
Information Retrieval , 16(1), 2013. ISSN1573-7659. (Cited on page 1.)[80] K. Hofmann, L. Li, and F. Radlinski. Online Evaluation for Information Retrieval.
FnTIR , 10(1), 2016.(Cited on page 1.)[81] J. Huang, W. Zhang, S. Zhao, S. Ding, and H. Wang. Learning to explain entity relationships bypairwise ranking with convolutional neural networks. In
IJCAI . AAAI, 2017. (Cited on pages 58and 112.)[82] W. Huang, Z. Wu, C. Liang, P. Mitra, and C. L. Giles. A neural probabilistic model for context basedcitation recommendation. In
AAAI . AAAI, 2015. (Cited on page 104.)[83] B. Huurnink, L. Hollink, W. v. d. Heuvel, and M. d. Rijke. Search behavior of media professionals atan audiovisual archive: A transaction log analysis.
JASIST , 61(6), 2010. ISSN 1532-2890. (Cited onpages 4 and 89.)[84] R. Jagerman, H. Oosterhuis, and M. de Rijke. To Model or to Intervene: A Comparison of Coun-terfactual and Online Learning to Rank from User Interactions. In
SIGIR . ACM, 2019. (Cited onpage 1.)[85] K. J¨arvelin and J. Kek¨al¨ainen. Cumulated gain-based evaluation of IR techniques.
TOIS , 20(4), 2002.(Cited on page 20.)[86] C. Jeong, S. Jang, H. Shin, E. Park, and S. Choi. A Context-Aware Citation Recommendation Modelwith BERT and Graph Convolutional Networks. arXiv:1903.06464 [cs] , 2019. arXiv: 1903.06464.(Cited on page 104.)[87] R. Jha, A.-A. Jbara, V. Qazvinian, and D. R. Radev. NLP-driven citation analysis for scientometrics.
NLE , 23(1), 2017. (Cited on pages 104 and 114.)[88] T. Joachims. Optimizing search engines using clickthrough data. In
KDD . ACM, 2002. (Cited onpages 1, 66, 73, and 113.)[89] M. Joshi, O. Levy, D. S. Weld, and L. Zettlemoyer. BERT for Coreference Resolution: Baselines andAnalysis. In
EMNLP-IJCNLP . ACL, 2019. (Cited on page 71.)[90] N. Kanhabua, R. Blanco, and K. Nørv˚ag. Temporal Information Retrieval.
FnTIR , 9(2), 2015. (Citedon pages 103, 106, and 114.)[91] Y. Kim, J. Seo, and W. B. Croft. Automatic boolean query suggestion for professional search. In
SIGIR . ACM, 2011. (Cited on pages 2 and 96.)[92] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In
ICLR , 2014. (Cited onpage 46.)[93] K. Kirkpatrick. Putting the Data Science into Journalism.
Commun. ACM , 58(5), 2015. (Cited onpages 2, 4, and 89.)[94] I. Konstas and M. Lapata. A global model for concept-to-text generation.
JAIR , 48, 2013. (Cited onpage 36.)[95] A. M. Krasakis, M. Aliannejadi, N. Voskarides, and E. Kanoulas. Analysing the effect of clarifyingquestions on document ranking in conversational search. In
ICTIR . ACM, 2020. (Cited on page 8.)[96] V. Kumar and S. Joshi. Incomplete Follow-up Question Resolution Using Retrieval Based Sequenceto Sequence Learning. In
SIGIR . ACM, 2017. (Cited on pages 66, 67, and 71.)[97] A. Lavie and A. Agarwal. METEOR: An automatic metric for MT evaluation with high levels ofcorrelation with human judgments. In
WMT . ACL, 2007. (Cited on page 36.)[98] V. Lavrenko and W. B. Croft. Relevance based language models. In
SIGIR . ACM, 2001. (Cited onpages 68 and 113.)[99] R. Lebret, D. Grangier, and M. Auli. Neural text generation from structured data with application tothe biography domain. In
EMNLP . ACL, 2016. (Cited on pages 29 and 41.)[100] G. G. Lee, J. Seo, S. Lee, H. Jung, B.-H. Cho, C. Lee, B.-K. Kwak, J. Cha, D. Kim, J. An, et al. SiteQ:Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP.In
TREC . NIST, 2001. (Cited on page 16.)[101] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’s multi-passsieve coreference resolution system at the CoNLL-2011 shared task. In
CoNLL . ACL, 2011. (Cited onpage 14.)[102] C. Li, M. Bendersky, V. Garg, and S. Ravi. Related Event Discovery. In
WSDM . ACM, 2017. (Citedon pages 104 and 106.)[103] H. Li and J. Xu.
Semantic Matching in Search . Now Publishers, 2014. (Cited on page 98.)
NAACL-HLT . ACL, 2016. (Cited on page 65.)[105] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In
Text summarization branchesout . ACL, 2004. (Cited on page 36.)[106] T. Lin, P. Pantel, M. Gamon, A. Kannan, and A. Fuxman. Active objects: Actions for entity-centricsearch. In
WWW . ACM, 2012. (Cited on pages 1, 27, and 28.)[107] Y. Lin, Z. Liu, and M. Sun. Knowledge representation learning with entities, attributes and relations.In
IJCAI . AAAI, 2016. (Cited on page 35.)[108] Y. Lin, Y. C. Tan, and R. Frank. Open sesame: Getting inside bert’s linguistic knowledge. arXivpreprint arXiv:1906.01698 , 2019. (Cited on page 71.)[109] X. Ling and D. S. Weld. Fine-grained entity recognition. In
AAAI . AAAI Press, 2012. (Cited onpage 59.)[110] D. E. Losada. A study of statistical query expansion strategies for sentence retrieval. In
SIGIR . ACM,2008. (Cited on page 13.)[111] S. MacAvaney. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In
Proceedings of the 13thInternational Conference on Web Search and Data Mining . ACM, 2020. (Cited on pages 95 and 96.)[112] S. MacAvaney, A. Yates, A. Cohan, and N. Goharian. CEDR: Contextualized Embeddings forDocument Ranking. In
SIGIR . ACM, 2019. (Cited on pages 70 and 78.)[113] A. MacLaughlin, T. Chen, B. K. Ayan, and D. Roth. Context-Based Quotation Recommendation. In
ICWSM . ACM, 2021. (Cited on pages 4, 89, 95, 104, 107, and 114.)[114] N. Maiden and K. Zachos. INJECT: Algorithms to Discover Creative Angles on News. 2020. MeetingName: Computation + Journalism 2020. (Cited on pages 89, 104, 107, and 114.)[115] C. D. Manning, P. Raghavan, and H. Sch¨utze.
Introduction to information retrieval , volume 1.Cambridge university press, 2008. (Cited on pages 1 and 103.)[116] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On Faithfulness and Factuality in AbstractiveSummarization. In
ACL . ACL, 2020. (Cited on page 112.)[117] O. A. McBryan. GENVL and WWWW: Tools for Taming the Web. In
WWW , 1994. (Cited on page 2.)[118] E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. In
WSDM . ACM,2012. (Cited on pages 15 and 31.)[119] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of wordsand phrases and their compositionality. In
Advances in Neural Information Processing Systems , 2013.(Cited on page 17.)[120] I. Miliaraki, R. Blanco, and M. Lalmas. From ”selena gomez” to ”marlon brando”: Understandingexplorative entity search. In
WWW . ACM, 2015. (Cited on pages 1 and 41.)[121] D. Milne and I. H. Witten. Learning to link with Wikipedia. In
CIKM . ACM, 2008. (Cited on page 15.)[122] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction withoutlabeled data. In
ACL . ACL, 2009. (Cited on pages 16, 51, 52, 59, and 68.)[123] M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In
SIGIR . ACM, 1998.(Cited on page 66.)[124] A. Mohan, Z. Chen, and K. Q. Weinberger. Web-search ranking with initialized gradient boostedregression trees. In
Yahoo! Learning to Rank Challenge , 2011. (Cited on page 2.)[125] V. Murdock and W. B. Croft. A translation model for sentence retrieval. In
EMNLP . ACL, 2005.(Cited on page 17.)[126] V. G. Murdock.
Aspects of Sentence Retrieval . PhD thesis, University of Massachusetts Amherst,2006. (Cited on page 13.)[127] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. MS MARCO: Ahuman generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 , 2016.(Cited on page 75.)[128] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning forknowledge graphs: From multi-relational link prediction to automated knowledge graph construction.
Proc. of the IEEE , 104(1), 2016. (Cited on pages 1, 30, and 43.)[129] V. Niculae, C. Suen, J. Zhang, C. Danescu-Niculescu-Mizil, and J. Leskovec. QUOTUS: The Structureof Political Media Coverage as Revealed by Quoting Patterns. In
WWW . ACM, 2015. (Cited onpages 93 and 99.)[130] R. Nogueira and K. Cho. Task-Oriented Query Reformulation with Reinforcement Learning. In
EMNLP . ACL, 2017. (Cited on pages 68 and 76.)[131] R. N. Oddy. Information retrieval through man-machine dialogue.
Journal of Documentation , 33(1),1977. (Cited on page 67.) . Bibliography [132] K. D. Onal, Y. Zhang, I. S. Altingovde, M. M. Rahman, P. Karagoz, A. Braylan, B. Dang, H.-L. Chang,H. Kim, Q. McNamara, A. Angert, E. Banner, V. Khetan, T. McDonnell, A. T. Nguyen, D. Xu, B. C.Wallace, M. de Rijke, and M. Lease. Neural information retrieval: At the end of the early years.
IRJ ,21(2–3), 2018. (Cited on page 70.)[133] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: A method for automatic evaluation of machinetranslation. In
ACL . ACL, 2002. (Cited on page 36.)[134] T. Pellissier Tanon, D. Vrandeˇci´c, S. Schaffert, T. Steiner, and L. Pintscher. From freebase to wikidata:The great migration. In
WWW . ACM, 2016. (Cited on page 44.)[135] B. Peng, X. Li, J. Gao, J. Liu, and K.-F. Wong. Deep Dyna-Q: Integrating Planning for Task-CompletionDialogue Policy Learning. In
ACL . ACL, 2018. (Cited on page 65.)[136] H. Peng, J. Liu, and C.-Y. Lin. News Citation Recommendation with Implicit and Explicit Semantics.In
ACL . ACL, 2016. (Cited on page 96.)[137] F. Petroni, P. Lewis, A. Piktus, T. Rockt¨aschel, Y. Wu, A. H. Miller, and S. Riedel. How context affectslanguage models’ factual predictions. In
AKBC , 2020. (Cited on page 106.)[138] D. Pighin, M. Cornolti, E. Alfonseca, and K. Filippova. Modelling events through memory-based,open-ie patterns for abstractive summarization. In
ACL . ACL, 2014. (Cited on page 28.)[139] G. Pirr`o. Explaining and suggesting relatedness in knowledge graphs. In
ISWC , 2015. (Cited onpages 49, 50, 54, and 58.)[140] O. Popescu and C. Strapparava, editors.
Proceedings of the 2017 EMNLP Workshop: Natural LanguageProcessing meets Journalism . ACL, 2017. (Cited on page 4.)[141] Y. Qiao, C. Xiong, Z. Liu, and Z. Liu. Understanding the Behaviors of BERT in Ranking. arXivpreprint arXiv:1904.07531 , 2019. (Cited on pages 70 and 95.)[142] C. Qu, L. Yang, M. Qiu, Y. Zhang, C. Chen, W. B. Croft, and M. Iyyer. Attentive History Selection forConversational Question Answering. In
CIKM . ACM, 2019. (Cited on page 65.)[143] F. Radlinski and N. Craswell. A theoretical framework for conversational search. In
CHIIR . ACM,2017. (Cited on page 67.)[144] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.
JMLR , 21(140), 2020.(Cited on page 112.)[145] D. Raghu, S. Indurthi, J. Ajmera, and S. Joshi. A statistical approach for Non-Sentential UtteranceResolution for Interactive QA System. In
SIGDIAL . ACL, 2015. (Cited on pages 66 and 67.)[146] S. Reddy, D. Chen, and C. D. Manning. CoQA: A Conversational Question Answering Challenge.
TACL , 7, 2019. (Cited on page 65.)[147] R. Reinanda, E. Meij, and M. de Rijke. Knowledge graphs: An information retrieval perspective.
FnTIR , 2020. (Cited on page 1.)[148] E. Reiter, R. Dale, and Z. Feng.
Building Natural Language Generation Systems . MIT Press, 2000.(Cited on page 29.)[149] P. Ren, Z. Chen, C. Monz, J. Ma, and M. de Rijke. Thinking globally, acting locally: Distantlysupervised global-to-local knowledge selection for background based conversation. In
AAAI , 2020.(Cited on page 68.)[150] X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognitionand typing by relation phrase-based clustering. In
KDD . ACM, 2015. (Cited on page 59.)[151] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In
ECML-PKDD . Springer-Verlag, 2010. (Cited on page 59.)[152] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity Recognition in Tweets: An ExperimentalStudy. In
EMNLP . ACL, 2011. (Cited on page 68.)[153] S. Robertson and H. Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond.
FnTIR , 3(4), 2009. ISSN 1554-0669, 1554-0677. (Cited on page 95.)[154] M. Rospocher, M. van Erp, P. Vossen, A. Fokkens, I. Aldabe, G. Rigau, A. Soroa, T. Ploeger, andT. Bogaard. Building event-centric knowledge graphs from news.
JoWS , 37-38, 2016. (Cited onpages 99, 103, and 113.)[155] C. Rosset, C. Xiong, X. Song, D. Campos, N. Craswell, S. Tiwary, and P. Bennett. Leading Conversa-tional Search by Suggesting Useful Questions. In
WWW . ACM, 2020. (Cited on page 1.)[156] B. Roth, T. Barth, M. Wiegand, and D. Klakow. A survey of noise reduction methods for distantsupervision. In
AKBC . ACM, 2013. (Cited on page 113.)[157] T. Russell-Rose, J. Chamberlain, and L. Azzopardi. Information retrieval in the workplace: a compari-son of professional search practices.
IPM , 54(6), 2018. (Cited on page 2.)[158] T. Saier and M. F¨arber. Semantic Modelling of Citation Contexts for Context-Aware Citation Recom- endation. In J. M. Jose, E. Yilmaz, J. Magalh˜aes, P. Castells, N. Ferro, M. J. Silva, and F. Martins,editors,
ECIR . Springer, 2020. (Cited on page 104.)[159] G. Saldanha, O. Biran, K. McKeown, and A. Gliozzo. An entity-focused approach to generatingcompany descriptions. In
ACL . ACL, 2016. (Cited on pages 29 and 36.)[160] F. Sarvi, N. Voskarides, L. Mooiman, S. Schelter, and M. de Rijke. A comparison of supervised learningto match methods for product search. In eCOM 2020: The 2020 SIGIR Workshop on eCommerce .ACM, 2020. (Cited on page 8.)[161] S. Sauer. Audiovisual Narrative Creation and Creative Retrieval: How Searching for a Story Shapesthe Story.
JSTA , 9(2), 2017. (Cited on page 2.)[162] A. See, P. J. Liu, and C. D. Manning. Get To The Point: Summarization with Pointer-GeneratorNetworks. In
ACL . ACL, 2017. (Cited on page 77.)[163] V. Setty and K. Hose. Event2Vec: Neural Embeddings for News Events. In
SIGIR . ACM, 2018. (Citedon page 106.)[164] S. Seufert, K. Berberich, S. J. Bedathur, S. K. Kondreddi, P. Ernst, and G. Weikum. Espresso:Explaining relationships between entity sets. In
CIKM . ACM, 2016. (Cited on page 58.)[165] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In
SIGIR . ACM, 2005. (Cited on page 1.)[166] G. Sidiropoulos, N. Voskarides, and E. Kanoulas. Knowledge graph simple question answering forunseen domains. In
AKBC , 2020. (Cited on page 8.)[167] M. Sloan, H. Yang, and J. Wang. A term-based methodology for query reformulation understanding.
Information Retrieval , 18(2), 2015. (Cited on page 68.)[168] C. L. Smith and S. Y. Rieh. Knowledge-Context in Search Systems: Toward Information-LiterateActions. In
CHIIR . ACM, 2019. (Cited on page 1.)[169] I. Soboroff, S. Huang, and D. Harman. Trec 2018 news track overview. In
TREC . NIST, 2018. (Citedon page 92.)[170] D. Sorokin and I. Gurevych. Context-aware representations for knowledge base relation extraction. In
EMNLP . ACL, 2017. (Cited on page 52.)[171] M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to rank answers to non-factoid questions fromweb collections.
Computational Linguistics , 37(2), 2011. (Cited on page 13.)[172] M. Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Manning. Multi-instance multi-label learning forrelation extraction. In
EMNLP-CoNLL . ACL, 2012. (Cited on page 59.)[173] R. Sylvester and W.-l. Greenidge. Digital Storytelling: Extending the Potential for Struggling Writers.
The Reading Teacher , 63(4), 2009. (Cited on page 2.)[174] S. Teufel, J. Carletta, and M. Moens. An annotation scheme for discourse-level argumentation inresearch articles. In
EACL . ACL, 1999. (Cited on page 106.)[175] S. Teufel, A. Siddharthan, and D. Tidhar. Automatic classification of citation function. In
EMNLP .ACL, 2006. (Cited on pages 106 and 114.)[176] A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In
SIGIR . ACM, 1998. (Cited on pages 27 and 28.)[177] M. Tsagkias, M. de Rijke, and W. Weerkamp. Linking online news and social media. In
WSDM 2011 .ACM, 2011. (Cited on page 15.)[178] A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast generation of result snippets in websearch. In
SIGIR . ACM, 2007. (Cited on page 28.)[179] S. Vakulenko, S. Longpre, Z. Tu, and R. Anantha. Question Rewriting for Conversational QuestionAnswering. arXiv:2004.14652 [cs] , 2020. arXiv: 2004.14652. (Cited on page 113.)[180] T. A. van Dijk.
News as discourse . University of Groningen, 1988. (Cited on pages 89 and 91.)[181] C. Van Gysel, E. Kanoulas, and M. de Rijke. Lexical query modeling in session search. In
ICTIR .ACM, 2016. (Cited on pages 66, 68, and 78.)[182] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.Attention is All you Need. In
NIPS . 2017. (Cited on pages 66 and 71.)[183] S. Verberne, J. He, U. Kruschwitz, G. Wiggers, B. Larsen, T. Russell-Rose, and A. P. de Vries. FirstInternational Workshop on Professional Search.
SIGIR Forum , 52(2), 2019. (Cited on page 2.)[184] N. Voskarides, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Query-dependent contextual-ization of streaming data. In
ECIR . Springer, 2014. (Cited on page 8.)[185] N. Voskarides, E. Meij, M. Tsagkias, M. de Rijke, and W. Weerkamp. Learning to explain entityrelationships in knowledge graphs. In
ACL-IJCNLP . ACL, 2015. (Cited on pages 7, 11, 28, 31,and 58.)[186] N. Voskarides, E. Meij, and M. de Rijke. Generating descriptions of entity relationships. In
ECIR . . Bibliography Springer, 2017. (Cited on pages 7, 27, 42, 52, and 58.)[187] N. Voskarides, E. Meij, R. Reinanda, A. Khaitan, M. Osborne, G. Stefanoni, K. Prabhanjan, andM. de Rijke. Weakly-supervised contextualization of knowledge graph facts. In
SIGIR . ACM, 2018.(Cited on pages 7, 41, 68, and 81.)[188] N. Voskarides, D. Li, A. Panteli, and P. Ren. ILPS at TREC 2019 Conversational Assistant Track.TREC, NIST, 2019. (Cited on page 8.)[189] N. Voskarides, D. Li, P. Ren, E. Kanoulas, and M. de Rijke. Query resolution for conversational searchwith limited supervision. In
SIGIR . ACM, 2020. (Cited on pages 8 and 65.)[190] N. Voskarides, E. Meij, S. Sauer, and M. de Rijke. News article retrieval in context for event-centricnarrative creation. In
Under submission , 2020. (Cited on pages 8 and 89.)[191] A. Vtyurina, D. Savenkov, E. Agichtein, and C. L. A. Clarke. Exploring conversational search withhumans, assistants, and wizards. In
CHI . ACM, 2017. (Cited on page 67.)[192] L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In
SIGIR .ACM, 2011. (Cited on pages 69 and 94.)[193] J. Weston, A. Bordes, O. Yakhnenko, and N. Usunier. Connecting language and knowledge bases withembedding models for relation extraction. In
EMNLP . ACL, 2013. (Cited on page 11.)[194] R. W. White, P. Bailey, and L. Chen. Predicting user interests from contextual information. In
SIGIR .ACM, 2009. (Cited on page 1.)[195] S. Wiseman, S. Shieber, and A. Rush. Learning Neural Templates for Text Generation. In
EMNLP .ACL, 2018. (Cited on page 112.)[196] I. Witten and D. Milne. An effective, low-cost measure of semantic relatedness obtained from wikipedialinks. In
AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy . AAAI, 2008.(Cited on pages 16 and 17.)[197] F. Wu and D. S. Weld. Open information extraction using wikipedia. In
ACL . ACL, 2010. (Cited onpage 14.)[198] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, D. Liu, X. Xie, J. Gao, W. Wu, and M. Zhou. MIND:A Large-scale Dataset for News Recommendation. In
ACL . ACL, 2020. (Cited on page 95.)[199] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Ranking, boosting, and model adaptation. Technicalreport, Microsoft Research, 2008. (Cited on page 36.)[200] X. Xu, O. Duˇsek, J. Li, V. Rieser, and I. Konstas. Fact-based Content Weighting for EvaluatingAbstractive Summarisation. In
ACL . ACL, 2020. (Cited on page 112.)[201] H. Yang, D. Guan, and S. Zhang. The query change model: Modeling session search as a markovdecision process.
TOIS , 33(4), 2015. (Cited on pages 66, 68, and 77.)[202] P. Yang, H. Fang, and J. Lin. Anserini: Enabling the Use of Lucene for Information Retrieval Research.In
SIGIR . ACM, 2017. (Cited on pages 95 and 96.)[203] W. Yang, H. Zhang, and J. Lin. Simple applications of BERT for ad hoc document retrieval. arXivpreprint arXiv:1903.10972 , 2019. (Cited on pages 70, 71, 95, and 113.)[204] Y. Yang, W. Chen, Z. Li, Z. He, and M. Zhang. Distantly Supervised NER with Partial AnnotationLearning and Reinforcement Learning. In
COLING . ACL, 2018. (Cited on page 68.)[205] X. Yao, B. Van Durme, and P. Clark. Automatic coupling of answer extraction and informationretrieval. In
ACL . ACL, 2013. (Cited on page 21.)[206] W. V. Yarlott, C. Cornelio, T. Gao, and M. Finlayson. Identifying the Discourse Function of NewsArticle Paragraphs. In
Workshop on Events and Stories in the News 2018 . ACL, 2018. (Cited onpage 89.)[207] M. Yatskar. A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC. In
NAACL-HLT . ACL, 2019.(Cited on page 66.)[208] W.-t. Yih, M.-W. Chang, X. He, and J. Gao. Semantic parsing via staged query graph generation:Question answering with knowledge base. In
ACL . ACL, 2015. (Cited on pages 1, 30, and 41.)[209] S. Yu, J. Liu, J. Yang, C. Xiong, P. Bennett, J. Gao, and Z. Liu. Few-Shot Generative ConversationalQuery Rewriting. In
SIGIR . ACM, 2020. (Cited on page 113.)[210] H. Zamani, S. Dumais, N. Craswell, P. Bennett, and G. Lueck. Generating Clarifying Questions forInformation Retrieval. In
WWW . ACM, 2020. (Cited on page 1.)[211] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In
SIGIR .ACM, 2004. (Cited on page 2.)[212] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hocinformation retrieval. In
SIGIR . ACM, 2001. (Cited on page 69.)[213] A. Zubiaga, H. Ji, and K. Knight. Curating and contextualizing Twitter stories to assist with socialnewsgathering. In
IUI . ACM, 2013. (Cited on pages 4, 89, and 104.) ummary
Search engines leverage knowledge to improve information access. Such knowledgecomes in different forms: unstructured knowledge (e.g., textual documents) and struc-tured knowledge (e.g., relationships between real-world objects and topics). In order toeffectively leverage knowledge, search engines should account for context, i.e., addi-tional information about the user and the query. In this thesis, we aim to support searchengines in leveraging knowledge while accounting for different types of context.In the first part of this thesis, we study how to make structured knowledge moreaccessible to the user when the search engine proactively provides such knowledgeas context to enrich search results. We focus on knowledge graphs (KGs), whichstore world knowledge in the form of facts, i.e., relationships between entities (e.g.,persons, locations, organizations). Since KG facts are stored in a formal form, theyare not suitable for presentation to end users. As a first task, we study how to retrievenatural language descriptions of KG facts from a text corpus. We propose a methodthat successfully extracts and then ranks descriptions of KG facts. The method breaksdown when a description for a certain KG fact does not exist. This leads us to oursecond task, where we study how to automatically generate KG fact descriptions. Wepropose a method that first creates sentence templates and then fills them with relevantinformation from the KG. KG fact descriptions often contain mentions to other relatedfacts that can increase the understanding of the fact as a whole. As a third task, we studyhow to contextualize KG facts, that is, automatically find facts related to a query fact.We propose a method that enumerates KG facts in the neighborhood of the query factand then ranks them with respect to their relevance to the query fact.In the second part of this thesis, we move to a different research theme and study howto improve interactive knowledge gathering. We focus on conversational search, wherethe user interacts with the search engine to gather knowledge over large unstructuredknowledge repositories. Here, the search engine should account for context that stemsfrom interactions between the user and the search engine in a conversational searchsession. We focus on multi-turn passage retrieval as an instance of conversational search.A prominent challenge is that the current turn query may be underspecified. Thus, weneed to perform query resolution, that is, add missing context from the conversationhistory to the current turn. We propose to model query resolution as a term classificationtask and propose a method to address it.In the third and final part of this thesis, we focus on a specific type of search engineusers, professional writers in the news domain. We study how to support such writerscreate event-narratives by exploring knowledge from a corpus of news articles. Wefocus on a scenario where the writer has already generated an incomplete narrative thatconsists of a main event and a context, and aims to retrieve news articles that discussrelevant events from the past. We formally define the task of news article retrieval incontext for event-centric narrative creation. We propose a retrieval dataset constructionprocedure for this task that relies on existing news articles to simulate incompletenarratives and relevant articles. We study the performance of multiple rankers, lexicaland semantic, and perform an in-depth quantitative and qualitative analysis to acquireinsights into the characteristics of this task. 123 amenvatting
Zoekmachines maken gebruik van kennis om de toegang tot informatie te verbeteren.Dergelijke kennis komt in verschillende vormen voor: ongestructureerde kennis (bijv.tekstdocumenten) en gestructureerde kennis (bijv. relaties tussen objecten in de echtewereld en onderwerpen). Om kennis effectief te benutten, moeten zoekmachinesrekening houden met de context, in dit geval, aanvullende informatie over de gebruikeren de zoekopdracht. In dit proefschrift willen we zoekmachines ondersteunen bij hetbenutten van kennis en tegelijkertijd rekening houden met verschillende soorten context.In het eerste deel van dit proefschrift bestuderen we hoe gestructureerde kennis toe-gankelijker voor de gebruiker kan worden gemaakt wanneer de zoekmachine proactiefkennis zoals context verschaft om zoekresultaten te verrijken. We richten ons op kennis-grafen (KG’s), die wereldkennis opslaan in de vorm van feiten, d.w.z. relaties tussenentiteiten (bijv. personen, locaties, organisaties). Omdat KG-feiten in een formele vormworden opgeslagen, zijn ze niet geschikt voor presentatie aan eindgebruikers. Als eerstetaak bestuderen we hoe we beschrijvingen van KG-feiten in natuurlijke taal uit eentekstcorpus kunnen halen. We stellen een methode voor die met succes de beschrijvingvan KG-feiten extraheert en rangschikt. De methode werkt echter niet als er geenbeschrijving voor een bepaald KG-feit bestaat. Dit leidt ons naar onze tweede taak, waarwe bestuderen hoe we automatisch KG-feitbeschrijvingen kunnen genereren. We stelleneen methode voor die eerst zinssjablonen maakt en deze vervolgens vult met relevanteinformatie uit de KG. KG-feitbeschrijvingen bevatten vaak vermeldingen van anderegerelateerde feiten die het begrip van het feit als geheel kunnen vergroten. Als derdetaak bestuderen we hoe we KG-feiten kunnen contextualiseren, dat wil zeggen, automa-tisch feiten vinden die verband houden met een vraagfeit. We stellen een methode voordie KG-feiten opsomt in de buurt van het vraagfeit en deze vervolgens rangschikt metbetrekking tot hun relevantie voor het vraagfeit.In het tweede deel van dit proefschrift stappen we over op een ander onderzoeks-thema en bestuderen we hoe we interactieve kennisvergaring kunnen verbeteren. Werichten ons op conversational search, waarbij de gebruiker interactie heeft met de zoek-machine om kennis te vergaren over grote ongestructureerde kennisverzamelingen. Hiermoet de zoekmachine rekening houden met de context die voortkomt uit interactiestussen de gebruiker en de zoekmachine in een conversatiezoeksessie. We richten onsop het ophalen van passages in meerdere beurten als een voorbeeld van conversationalsearch . Een prominente uitdaging is dat de vraag voor een enkele beurt mogelijk teweinig is gespecificeerd. We moeten dus een zoekopdrachtresolutie uitvoeren, dat wilzeggen: ontbrekende context uit de gespreksgeschiedenis toevoegen aan de huidigebeurt. We stellen voor om zoekopdrachtresolutie te modelleren als een termclassificati-etaak en stellen een methode voor om deze aan te pakken.In het derde en laatste deel van dit proefschrift richten we ons op een specifiek typegebruikers van zoekmachines, namelijk professionele schrijvers in het nieuwsdomein.We bestuderen hoe we dergelijke schrijvers kunnen ondersteunen bij het cre¨eren vanverhalen over gebeurtenissen door kennis uit een corpus van nieuwsartikelen te verken-nen. We richten ons op een scenario waarin de schrijver al een onvolledig verhaal heeftgeschreven dat bestaat uit een hoofdgebeurtenis en een context, en beoogt nieuwsartike-len op te halen die relevante gebeurtenissen uit het verleden bespreken. We defini¨eren125 . Samenvatting. Samenvatting