[PDF] Entities of Interest

Abstract

In the era of big data, we continuously - and at times unknowingly - leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. When aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that learn from their users and enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton's email controversy, or the Panama papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about "finding evidence of activity in the real world." This dissertation revolves around discovery in digital traces, and sits at the intersection of Information Retrieval, Natural Language Processing, and applied Machine Learning. We propose computational methods that aim to support the exploration and sense-making process of large collections of digital traces. We focus on textual traces, e.g., emails and social media streams, and address two aspects that are central to discovery in digital traces.

Full PDF

EE N T I T I E S O F I N T E R E S T D I S C O V E R Y I N D I G I T A L T R A C E S D A V I D G R A U S DISCOVERY IN DIGITAL TRACES D A V I D G R A U S a r X i v : . [ c s . I R ] F e b ntities of Interest Discovery in Digital Traces

David Graus ntities of Interest

Discovery in Digital Traces A CADEMISCH P ROEFSCHRIFT ter verkrijging van de graad van doctor aan deUniversiteit van Amsterdamop gezag van de Rector Magniﬁcusprof. dr. ir. K.I.J. Maexten overstaan van een door het College voor Promoties ingesteldecommissie, in het openbaar te verdedigen inde Agnietenkapelop vrijdag 16 juni 2017, te 10:00 uurdoor

David Paul Graus geboren te Beuningen romotiecommissie

Promotor: prof. dr. M. de Rijke Universiteit van AmsterdamCo-promotor: dr. E. Tsagkias 904LabsOverige leden: prof. dr. A.P.J. van den Bosch Radboud Universiteit Nijmegenprof. dr. ing. Z.J.M.H. Geradts Universiteit van Amsterdamdr. ir. J. Kamps Universiteit van Amsterdamdr. E. Kanoulas Universiteit van Amsterdamprof. dr. D.W. Oard University of MarylandFaculteit der Natuurwetenschappen, Wiskunde en InformaticaThe research was supported by the Netherlands Organiza-tion for Scientiﬁc Research (NWO) under project number727.011.005.SIKS Dissertation Series No. 2017-23The research reported in this thesis has been carried outunder the auspices of SIKS, the Dutch Research Schoolfor Information and Knowledge Systems.The printing of this thesis was ﬁnancially supported by theCo van Ledden Hulsebosch Center, Amsterdam Center forForensic Science and Medicine.Copyright © 2017 David Graus, Amsterdam, The NetherlandsCover by Rutger de Vries/perongeluk.comPrinted by Off Page, AmsterdamISBN: 978-94-6182-800-2 cknowledgements

I started the process that culminated in the little booklet you are now reading in 2012. Atthat time, I had never touched a unix, I didn’t know what an ssh was, and I hardly knewhow to write a line of code. I hate to be overly dramatic, but I have grown, not only asa researcher, also as a person, and this growth is not exclusively to be attributed to thepassing of time. Over the course of my PhD I have become more conﬁdent, about myselfand my skills, and I have learned I can pick up anything I put my mind (and time) to. Thelatter is by far the greatest life’s lesson my PhD has brought me.For all of this, I have to thank some people. First, P&C, for setting up the prior byraising me in a home where I naturally got exposed to computers, technology, and (overly)critical thinking. P, for putting me and Mark in front of a Sinclair ZX81 at young age,traumatizing us with the scariest ASCII-rendered games imaginable. It shaped me into thetough trooper I am today. C, for putting up with a grumpy household when our attemptsat assembling computers from parts scrambled from all over the place invariably failed.Next, Maarten, for taking a “chance” with me. I was quite wet behind the ears, as onewould say, before I started my PhD. I am glad and thankful that this did not hold Maartenback for giving me the opportunity to see how I would pan out as a computer scientist. Iwas a coin-ﬂip. Thanks for ﬂipping!Next next, ILPS. First, the senior population at ILPS. In particular, Edgar, who at thetime understood better than me what I was doing, and who saw that I could do and mightenjoy this PhD. Manos and Wouter, who guided, supervised, co-authored, and co-livedthis period with me. My peers at ILPS, of which I will name a few because these are myacknowledgements and I get to decide whether I single people out or not: Daan, Anne,Tom, and Marlies. I enjoyed interfering with your day-to-day activities, sharing misery,disagreeing (hi Anne), coffee breaks, bootcamps and runs, and everything else.Then, my co-authors, for partaking in that thing we do in science, which convenientlyprovided the basis of this dissertation. Paul, Ryen, and Eric for mentoring me during myinternship at Microsoft Research in the summer of 2015. Best summer ever.Finally, my brothers from other mothers, Rutger and Marijn, for putting up with mebeing right all the time, slightly cocky, and (at times) tending towards the far-right end ofthe spectrum of self-conﬁdence. I’m sorry.Moving on to the nonliving things: Polder, Oerknal, Maslow, and Joost (in order ofappearance) for providing shelter for unwinding, escaping, brainstorming, philosophizingand reﬂecting on life, academia, and beyond. Oh, and also for drinking beers. Twitter, forgiving me a place where I could write down stuff when Christophe would stop listening tothe monologues I typically directed at my computer screen (but were intended for all tohear). Spotify and Bose, for sheltering me from the real world during extended periods of(coding) cocooning.Okay, this is not an exhaustive list, but I have to both start and stop somewhere.Before we part ways and move onto the researchy stuff, I will leave you with one big fatclich´e—which I know loses some of its expressive power by virtue of being a clich´e (butwhich I hope to counteract a bit by being explicit about my awareness of it being a clich´e): It was a great journey!

David I really don’t. . . ontents

I Analyzing, Predicting, and Retrieving Emerging Entities 29

ONTENTS

II Analyzing and Predicting Activity from Digital Traces 101

ONTENTS

Bibliography 147Samenvatting 173Summary 174 vii

Introduction “Dis-moi ce que tu manges, je te dirai ce que tu es.”—

Jean Anthelme Brillat-Savarin

From the things you “like” on Facebook, algorithms can infer many personal and de-mographic traits with surprisingly high accuracy. For example, political preferences,religious beliefs, but even more obscure factoids, such as whether someone’s parents weretogether during their childhood [276]. The predictive power of these algorithms relieson the availability of digital traces of a large number of people. In the era of

Big Data ,digital traces are available in abundance. We willingly share, post, like, link, play, andquery, all the while leaving behind rich digital traces. Traces that, when aggregated, canprovide meaningful insights into behavior, preferences, and proﬁles of people.We distinguish between two types of digital traces:

Active digital traces that peopleleave behind deliberately, e.g., blogposts, tweets, or forum posts, and passive digitaltraces , that people may leave behind unknowingly, e.g., by clicking on links on websites,querying web search engines, or by simply visiting websites [94]. The quality of and ourreliance on online services has enabled wide availability of these digital traces. Combinedwith rapid developments in computer hardware, software, and algorithms, the availabilityof our digital traces has given rise to (yet) a(nother) resurge of Artiﬁcial Intelligence (AI).With it, a new economy is on the rise that revolves around applied machine learning forunderstanding users, e.g., inferring their traits, preferences, and behavior, for applicationssuch as targeted advertising, personalizing content, and more generally, improving onlineservices.People have voiced concerns around this rise of the “algorithm,” hailing it as thedemise of our privacy and freedom [242], the death of politics [189], and in generalharmful [247]. But indisputably, mining, collecting, analyzing, and leveraging our digitaltraces has brought many advances. At its most visible level, it has permeated into ourday-to-day lives, and has enabled our access to exceeding and unforeseen amounts of data,information, and knowledge. While this information access may seem self-evident to us,it is not. Enabling effective access to these huge amounts of information relies to a largedegree on understanding people’s behavior, preferences, and traits. The “relevance” ofinformation is highly user-speciﬁc, it depends on the context of the user (e.g., the timeof day, geographical location, or a task the user may be executing), and more personalaspects such as taste, preference, or background [186]. 1 . Introduction

We would be unable to ﬁnd what we wanted if it were not for the search engines thatlearn from their users [122, 227], the ﬁltering algorithms that ﬁnd and prioritize contentrelevant to us [54, 77], and the recommendation algorithms that allow services such asNetﬂix and Spotify to serve a wider variety of content, by no longer being constrained tocatering to the masses like traditional media.Next to these everyday-life, practical applications, analyzing and mining digital traceshas brought advances for (social) good, too. In the health domain, search engine logshave led to discoveries of new and previously unreported side effects of prescriptiondrugs [261], and adverse drug reactions [275]. Social media posts have allowed us todevelop methods for identifying people who suffer from depression [56]. And moregenerally, the research ﬁelds of sociology, anthropology, (social) psychology, and (digital)humanities have beneﬁted from the possibilities brought by large-scale analysis of (digital)traces [166].And this is just the beginning. More and more areas are starting to reap the beneﬁtsof advances in artiﬁcial intelligence and the ability to extract information from largecollections of digital data.

Discovery in Digital Traces

The Enron scandal [264], the Hillary Clinton email controversy [263], the Panama pa-pers [265], or any data-release by WikiLeaks [266], are all examples of cases in whichlarge amounts of (digital) traces needed to be investigated, explored, and turned upsidedown, to gain insights and discover “evidence.” This

E-Discovery task is at its core “ﬁnding evidence of activity in the real world” in “Electronically Stored Information”(ESI) [197]. Discovery in digital data ﬁnds applications in many domains and scenarios,e.g., in (investigative) journalism [151], digital forensics [128, 167], and litigation, wheredata may be requested by a plaintiff with the aim of gaining insights and ﬁnding evidencefor a legal case [197].Whether in litigation, journalism, or digital forensics, in this discovery task, usersset out to browse and explore large collections of data (usually text), with the aim ofdiscovering new information, or of ﬁnding answers to any questions they may have.Gaining insights and making sense of our digital traces may seem trivial. Often inwritten form, digital traces hold the promise of being explicit, structured, unambiguous,and readily interpretable. However, even written language is not as structured as it mayseem. First, language is noisy. Ambiguity is common across languages, the meaningof words may change over time, and without understanding the continuously changingcontext in which language is created, it is often impossible to understand its intendedmeaning [87]. Second, the amount of digital data we produce on a daily basis is enormousand ever-growing. We are exposed to and produce an ever-increasing volume of data,both online on e.g., social media, collaboration platforms, and email, and ofﬂine one.g., laptops, mobile phones, computers and storage media [161]. To address the abovechallenges, the need arises for automated methods for sense-making of digital traces fordiscovery, with the goal of both understanding their content and the contexts in whichthey are created.In this thesis, we draw inspiration from this discovery scenario, and propose methodsthat aim to support sense-making of digital traces. We focus on textual traces, e.g., emails,2 .1. Research Outline and Questions social media streams, and user interaction logs. In two parts in this thesis, we address twodifferent aspects that are central to sense-making from digital traces.In the ﬁrst part of the thesis, we study methods for sense-making of the content of(textual) digital traces. More speciﬁcally, our objects of study are the real-world entities that are referenced in digital traces. Knowing, e.g., which people are mentioned in emailcommunication between employees of a company, which companies are discussed bystockbrokers on Twitter, or which people and companies are mentioned in the Panamapapers, is central to the exploratory and investigative search process that is inherent tosense-making [268]. In this part, our entities of interest are real-world entities: “thingswith distinct and independent existence” [1] that exist in the real world . We thus study theoccurrence of concrete entities, e.g., companies, organizations, locations, and products indigital traces such as email, social media, and blogs.In the second part of the thesis, we focus on the context of digital traces. Here, our entities of interest are the producers of digital traces, i.e., the people that leave behind thetraces. Uncovering real-world activity is a central task in the E-Discovery sense-makingprocess [197], thus, we focus on methods for uncovering this (evidence of) real-worldactivity from digital traces. Our aim is to understand real-world behavior from digitaltraces. We present two case studies where we analyze and leverage patterns in behavior topredict activity of the people who leave behind digital traces.

As outlined above, we distinguish two research themes on automated methods for sense-making of digital traces:In the ﬁrst part, our entities of interest are the real-world entities that are referenced indigital traces. Identifying the real-world entities that appear in ESI such as emails, socialmedia, or forums and blogs, supports the exploratory and complex search process thatunderlies the discovery process. In the discovery scenario, the entities of interest maynot be known in advance, i.e., they may not be well-known or established entities. Forthis reason, we focus on so-called emerging real-world entities, i.e., entities that are notdescribed (yet) in publicly available knowledge bases such as Wikipedia. We addressthree tasks, (i) analyzing, (ii) predicting, and (iii) retrieving newly emerging real-worldentities.In the second part of this thesis, our entities of interest are the producers of the digitaltraces. We aim to (i) understand their behavior and real-world activity by analyzingdigital traces, and (ii) predict their (future) real-world activity, leveraging the analyses anduncovered patterns. In this part, we present two case studies where we (i) analyze andpredict email communication by studying the communication network and email content,and (ii) analyze interaction data with a mobile intelligent assistant to predict when a userwill perform an activity.

Part I. Analyzing, Predicting, and Retrieving Emerging Entities

The discovery task is a lot like looking for needles in haystacks, without knowing whatthe needles look like. Users of search systems in the discovery context (e.g., forensic3 . Introduction analysts, lawyers, or journalists) may be interested in ﬁnding evidence of real-worldactivity, without knowing beforehand what type of activity nor whose activity they arelooking for. Traditional IR methods fall short in supporting this type of exploratorysearch [237]. They largely rely on lexical matching , i.e., matching words in a search queryto words in a document, to allow users to retrieve documents from a collection. However,this type of search is not suitable for the typical discovery scenario, where the aim isto understand the 5 W’s:

Who was involved? What happened? Where, when and whydid it happen? [197]. For answering these questions, it is typically difﬁcult to formulatean exhaustive set of search queries, which means the keyword-based lexical matchingparadigm does not sufﬁce.Semantic Search, i.e., “search with meaning” [20], is an alternate search paradigm thataims to move beyond lexical matching, and improve document retrieval by incorporatingadditional (external) knowledge in the search process. This additional knowledge can bein the form of the discussion structure of email threads, the topical structure of documents,or information on the entities and their relations that are mentioned in documents [250].In the ﬁrst part of this thesis we focus on real-world entities that appear, i.e., arereferenced, in digital traces. Understanding which real-world entities appear and arediscussed in digital traces is of central importance in exploratory search [140], and inanswering the 5 W’s.A central task in semantic search is EL[20]. EL revolves around linking the mentionsof real-world entities in text to their representations in an external KB. Linking mentionsof real-world entities in text to their KB representations effectively boils down to disam-biguation, and enables the enrichment of text with additional information and structure.Knowing the unambiguous real-world entities that occur in text enables better supportfor searching through large text collections [26] and the complex information seekingbehavior that is inherent to the discovery process [268], through, e.g., enabling ﬁlteringfor speciﬁc entities (e.g., a person) or entity types (e.g., companies), and by providinginsights into how different entities co-occur.EL relies on external knowledge from a reference KB that contains (descriptions of)real-world entities [216]. An example and often-used instantiation of such a reference KBis Wikipedia, the world’s largest encyclopedia. Wikipedia is online, collectively built, andknown to democratize information [256]. At the time of writing, the English Wikipediacontains over 5 million articles that represent both abstract and real-world entities, writtenby over 29 million users [267]. With this broad coverage, it spans the majority of entitiesthat emerge in the media, and play a role in our public discourse, it has been dubbed a“global memory place,” where collective memories are built [205]. In addition, the richmetadata of Wikipedia, e.g., its hyperlink structure, category structure, and infoboxes,can be effectively employed as additional signals for disambiguating and improving EL,which makes Wikipedia an attractive standard reference KB for EL systems.Due to the ambiguous nature of language, knowing the words that refer to entities (i.e., entity mentions ) alone is unlikely to be sufﬁcient for answering the 5 W’s. To answer who , where , or what the mention refers to, the most important step of EL is not to identify anentity mention, but to correctly link it to its referent KB entry. The main challenge in ELis ambiguity: a single entity mention (e.g., “Michael” ) can refer to multiple real-worldentities in the KB (e.g., Michael Jordan , Michael Jackson , or even any other real-wordentity that may not be described in the KB). At the same time, a single KB entity may4 .1. Research Outline and Questions be referred to by multiple entity mentions (e.g.,

Michael Jackson may be referred to with “the king of pop,” “MJ,” “Mr. Jackson,” or simply “Michael” ).EL methods using Wikipedia have been effectively applied to many challengingdomains where context is sparse, and language may be noisy, e.g., Twitter [176] andtelevision subtitles [199]. However, the E-Discovery scenario provides an additionalchallenge over noisy and large volumes of data: the entities of interest may not beincluded in the reference KB.In the ﬁrst part of this thesis our entities of interest are these entities that are absentfrom the reference KB. Entities about which we may have little prior knowledge. We callthese entities emerging entities , as they are initially absent from the KB, but of interest toan end-user, i.e., worthy of being incorporated into the KB.We address three tasks that relate to emerging entities: The ﬁrst is largely observationalin nature, here, we study how entities emerge in online text streams, by analyzing thetemporal patterns of their mentions before they are added to a KB. Next, we bootstrap thediscovery of emerging entities in social media streams, by generating pseudo-ground truthto learn to predict which entity mentions represent emerging KB entities. And ﬁnally, wecollect external and user-generated additional entity descriptions to construct and enrichentity representations to allow searchers to retrieve them more effectively.

Chapter 3. Analyzing Emerging Entities

We start our exploration of real-world entities in digital traces with a large-scale analysis ofhow entities emerge. The digital traces we study are online text streams (e.g., news articlesand social media posts), and the entities of interest are emerging entities. Understandingthe temporal patterns under which entities emerge in online text streams, before they areadded to a KB may be helpful in predicting newly emerging entities of interest, or indistinguishing between different types of entities of interest. In Chapter 3 we study thetemporal patterns of emerging entities in online text streams, and answer the followingresearch question:

RQ1

Chapter 4. Predicting Emerging Entities

Motivated by the ﬁndings of the dynamic nature of knowledge bases, and continuousstream of newly emerging entities in Chapter 3, in Chapter 4 we address the task of5 . Introduction predicting newly emerging entities. The digital traces we study are social media streams(i.e., Twitter), and our entities of interest are entities that are similar to a set of seed entities,i.e., entities that are described in a reference KB. In the discovery process a searcher mayhave some prior knowledge of the entities of interest, e.g., she may have a reference listof entities with the goal of exploring a document collection in search for similar entities.In such a scenario, where the entities of interest are absent from the KB, traditional ELmethods fall short. In Chapter 4 we propose a method that leverages the prior knowledgethat is encoded in a reference KB, and existing methods for recognizing and linking KBentities, to recognize newly emerging entities that are not (yet) part of the KB. In thischapter, we answer the following research question:

RQ2

Can we leverage prior knowledge of entities of interest to bootstrap the discoveryof new entities of interest?We propose an unsupervised method, where we use an EL method and an incompletereference KB to generate pseudo-ground truth to train a named-entity recognizer to detectmentions of emerging entities. Mentions of emerging entities are entity mentions thatwere not linked by the EL method, but that do occur in similar contexts as those thatare in the reference KB. We compare the effect of different sampling strategies on thepseudo-ground truth, and the resulting predictions of emerging entities. We show thatsampling based on textual quality and the conﬁdence score of the EL method are effectivemethods for increasing the effectiveness of discovering absent entities. Furthermore,we show that with a small amount of prior knowledge, our method is able to cope withmissing labels and incomplete data, justifying the approach of generating pseudo-groundtruth. Finally, the method we propose is domain and/or language independent, as it doesnot rely on language or domain-speciﬁc features, making it particularly suitable for thediscovery scenario.

Chapter 5. Retrieving Emerging Entities

Finally, in Chapter 5, the last chapter of Part I, we address the task of retrieving entitiesof interest. As seen in Chapter 3, entities may suddenly appear in bursts, as eventsunfold in the real world. In order to capture the changing contexts in which entitiesappear, and improve the retrieval (i.e., search) of entities, we propose a method thatdynamically enriches and expands the (textual) representations of entities. We enrichrepresentations of an entity by collecting descriptions from a variety of different dynamicsources that represent the collective intelligence . The different sources constitute different(user-generated) digital traces, e.g., social media, social tags, and search engine logs. Inexpanding their representations with these collective digital sources, we bridge the gapbetween the informal ways how users search for and refer to entities, and the more formalways of how they are represented in the KB. In this chapter, we answer the followingresearch question:

RQ3

Can we leverage collective intelligence to construct entity representations for in-creased retrieval effectiveness of entities of interest?We combine the collected additional descriptions from digital traces into a single entityrepresentation, by learning to weight and combine the heterogeneous content from the6 .1. Research Outline and Questions different sources. Our method learns directly from users’ past interactions (i.e., searchqueries and clicks), and enables the retrieval system to continuously learn to optimize theentity representations towards how people search for and refer to the entities of interest.We ﬁnd that incorporating dynamic description sources into entity representations enablessearchers to better retrieve entities. In addition, we ﬁnd that informing the ranker of the“expansion state” of the entity overcomes challenges related to heterogeneity in entitydescriptions (i.e., popular entities may receive many descriptions, and less popular entitiesfew), and further improves retrieval effectiveness.

Part II. Analyzing and Predicting Activity from Digital Traces

In the second part of this thesis we take a different view on entities of interest. Here, ourentities of interest are the people who leave behind the digital traces, i.e., the producers ofdigital traces. Our aim is to better understand the producers’ context under which theyleave behind digital traces. This part revolves around ﬁnding evidence of activity in thereal world, which is an essential task in the sense-making process of E-Discovery [197].We present two case studies of analyzing and predicting human activity given digitaltraces.Analyzing and predicting user behavior has a rich history in the information retrievaland the user modeling, adaptation and personalization communities [124]. Understandinghuman behavior by studying historic interactions and digital traces ﬁnds many applications,e.g., improving personalization and recommendation systems [157, 158], improving searchengines [4, 8, 122], or improving information ﬁltering [188].In both case studies we look at the impact of the contexts in which digital traces arecreated, and their content , on predicting activity of entities of interest. In our ﬁrst casestudy, we analyze and predict email communication in an enterprise. In our second casestudy, we study interaction logs of users with an intelligent assistant on a mobile device,and aim to predict user task execution.

Chapter 6. Analyzing and Predicting Email Communication

Analyzing communication patterns can be helpful in answering questions like who wasinvolved? , and who knew what? , which are central questions in the discovery process.Increased understanding of the aspects that guide communication, and being able to predictcommunication ﬂows can be helpful in identifying atypical or unexpected communications,a valuable signal in the discovery process [200]. In this chapter, the digital traces understudy are enterprise email, and the entities of interest are emailers.We study the impact of contextual and content aspects of email in predicting commu-nication. More speciﬁcally, as context, we study the impact of leveraging the enterprise’s“communication graph” for predicting email recipients. The communication graph pro-vides contextual clues of the creation of digital traces, e.g., the “position” of an emailerin a communication graph may implicitly capture their position in the company, and thestrength of ties, or the proximity between two emailers may capture their professionalor social relations. Next, we study the impact the content of the digital traces, i.e., weleverage the similarity between emails for predicting likely recipients. In this chapter, weanswer the following research question: 7 . Introduction

RQ4

Can we predict email communication through modeling email content and commu-nication graph properties?To answer this question, we present a hybrid model for email recipient prediction, thatleverages both the information from the communication graph of the email network, andthe Language Models (LM) of the emailers, estimated with the content of the emails thatare sent by each user. We ﬁnd that both the context and content of digital traces providea strong baseline for predicting recipients, but are complementary, i.e., combining bothsignals achieves the highest accuracy for predicting the recipients of email. We obtainoptimal performance when we incorporate the number of emails a recipient has receivedso far, and the number of emails a given sender sent to a recipient at that point in time inour model.

Chapter 7. Analyzing and Predicting Task Reminders

Our second case study revolves around a proliferating type of digital traces; user inter-action logs with mobile devices, and more speciﬁcally, user interactions with a personalintelligent assistant. With the rise of mobile devices in E-Discovery, an ever importanttask is to understand real-life behavior of people through traces from their mobile devices.Intelligent assistants are becoming ubiquitous; Google’s Now, Apple’s Siri, Amazon’sEcho, Facebook’s M, and Microsoft’s Cortana have all been introduced in rapid successionover the last few years. In this chapter, we study user interaction logs with Microsoft’s Cor-tana. Through their personal, human-like and conversational nature, intelligent assistantsare more closely embedded in a user’s daily life and activity than some of the technologiesthey borrow from (e.g., search engines, calendar, planning, and time management apps).By closing the gap between a user’s ofﬂine and online world, intelligent assistants have apotentially large impact on a user’s life, such as on her productivity, time management, oractivity planning. The digital traces left behind through interacting with personal assistantsmay thus contain important signals related to the users’ ofﬂine real-world activities. Wefocus on studying Cortana’s reminder service, as reminders represent tangible traces ofpeople’s real-life (planned) activities and tasks. Understanding the temporal patternsrelated to reminder setting and task execution can help in inferring and understandingpeople’s behavior in the real world.We aim to identify common categories of tasks that people remind themselves of, andstudy temporal patterns linked to the types of tasks people execute. In this chapter, weanswer the following research question:

RQ5

Can we identify patterns in the times at which people create reminders, and, vianotiﬁcation times, when the associated tasks are to be executed?More speciﬁcally, we apply a data-driven analysis to identify a body of common taskstypes that give rise to the reminders across a large number of users. We arrange thesetasks into a taxonomy, and analyze their temporal patterns. Furthermore, we address aprediction task, and much like the work presented in Chapter 6, we study the impact ofboth the content and contexts of the digital traces. We show that the time at which auser creates a reminder (context) is a strong indication of when the task is scheduled tobe executed. However, including the description of the task reminder (content) furtherimproves prediction accuracy.8 .2. Main Contributions

In this section, we list the theoretical, technical, and empirical contributions made in thisthesis.

Analyses and Algorithms

A large-scale study of emerging entities

In Chapter 3 we study a large dataset whichcontains over 70,000 entities that emerge in a timespan of over 18 months in over579M documents, and show that entities emerge in broadly two types of patterns:with an initial burst of increased attention leading up to incorporating in Wikipedia,or a more gradual pattern, where the attention builds up over time. Furthermore,we identify characteristic differences between how entities emerge in news and insocial media streams. Finally, we show that speciﬁc entity types are more stronglyassociated with speciﬁc emergence patterns.

Unsupervised method for generating pseudo-ground truth using EL

In Chapter 4 wepropose an unsupervised method that uses an EL method for generating trainingmaterial for a named-entity recognizer to detect entities that are likely to becomeincorporated in a KB. Our method can be applied with any trainable Named-EntityRecognition (NER) model and EL method that is able to output a conﬁdence scorefor a linked entity. Furthermore, our method is not dependent on human annotationsthat are necessarily limited and domain and language speciﬁc, is not restricted to aparticular class of entities, and is suitable for domain and/or language adaptation asit does not rely on language speciﬁc features or sources.

Dynamic collective entity representations

In Chapter 5 we employ collective intelli-gence to construct “dynamic collective entity representations,” i.e., we dynamicallycreate entity representations that encapsulate the different ways of how people referto or talk about the entity. In doing so, we show we can bridge the gap betweenthe words used in a KB entity description and the words used by people to refer toentities.

Generative hybrid model for email recipient prediction

In Chapter 6 we propose ahybrid generative model aimed at calculating the probability of an email recipient,given the sender and the content of the email. We show how both the communi-cation graph properties as email content properties contribute to a highly accurateprediction of email recipients.

Large-scale study of user interaction logs with a personal intelligent assistant

InChapter 7 we present the ﬁrst large-scale study of the creation of common taskreminders. We show that users largely remind themselves for short-term chores andtasks, such as to go somewhere, communicate, or perform daily chores. We developa taxonomy of types of time-based reminders, facilitated by the data we haveabout the actual reminders created by a large populations of users. We show howreminders display different temporal patterns depending on the task type that theyrepresent, the creation time of the reminder, and the terms in the task description.9 . Introduction

We study temporal patterns in reminder setting and notiﬁcation, demonstratingnoteworthy patterns, which we leverage to build predictive models. The modelspredict the desired notiﬁcation time of reminders, given the reminder text andcreation time.

Empirical Contributions

A large-scale study of emerging entities

In Chapter 7 we apply a hierarchical agglom-erative clustering method to over 70,000 entity document-mention time series, anduncover two distinct patterns of emerging entities. We discover two distinct patternsof how entities emerge in public discourse.

Two sampling methods for pseudo-ground truth

Chapter 4 presents two sampling meth-ods for automatically generated training data. We show that (i) sampling based onthe textual quality improves performance of NER and consequently performance ofpredicting emerging entities, and (ii) sampling based on the conﬁdence score of theEL method which provides the pseudo-training data labels results in fewer labelsbut better performance.

Effect of KB size in emerging entity prediction

In Chapter 4 we study the effect of thesize of the KB (i.e., the amount of prior knowledge) on predicting newly emergingentities, and ﬁnd consistent and stable precision regardless of KB size, whichjustiﬁes our emerging entity prediction method that assumes incomplete data bydesign.

Dynamic collective entity representations

In Chapter 5 we show entity representationsenriched with external descriptions from various sources better capture how peoplesearch for entities than their original KB descriptions. Furthermore, we show thatinforming the ranker of the expansion state of the entity, i.e., the number and typeof additional descriptions the entity holds, increases retrieval effectiveness.

Email recipient prediction

In Chapter 6 we show that combining communication graphwith email content features achieves optimal predictive power. We show that thenumber of email received by a recipient is an effective method for estimating theprior probability of observing a recipient, and the number of emails sent betweentwo users is an effective way of estimating the “connectedness” between these twousers, and is a helpful signal in ranking recipients.

The ﬁrst chapter, which is the one you are currently reading, introduces and motivatesthe research topic of the thesis: automated methods for sense-making of digital traces.Furthermore, this chapter provides an overview of the main contributions, content, andorigins of the further chapters in this thesis. Chapter 2 discusses the background andrelated work that serves as a base for the following chapters. We describe InformationRetrieval (IR), Semantic Search, Entity Linking (EL), and user modeling and analysis.The core of this thesis consists of two parts.10 .4. Origins

In Part I of this thesis, we study emerging entities and incomplete knowledge bases. Itconsists of three chapters. Chapter 3 presents an empirical study of temporal patterns ofentities as they emerge in online text streams, before they are incorporated into Wikipedia.Chapter 4 presents a method of leveraging an incomplete reference KB to generatepseudo-training data for training a system for discovering emerging entities, similar to thereference entities but absent from the KB. Chapter 5 presents a novel method of leveragingcollective intelligence and learning from past behavior of users to construct dynamic entityrepresentations, aimed at improving entity retrieval effectiveness.Part II of this thesis revolves around analyzing and predicting human behavior fromdigital traces. Chapter 6 presents a case study of enterprise email communication, andprovides insights into the different aspects that guide communication between people.Chapter 7 presents a large-scale user log study of an intelligent personal assistant, andshows the types of task people tend to remind themselves about.Finally, Chapter 8 concludes this thesis, where we summarize the content and ﬁndingsof this thesis, discuss the limitations of the presented work, and brieﬂy reﬂect onto futurework.The two parts of this thesis (Part I and Part II) are self-contained and form independentparts. For the reader’s convenience, both the background chapter (Chapter 2), and theconclusions (Chapter 8) follow this structure, i.e., the relevant background material,conclusions, and future work parts are organized separately for each part, so that bothparts can be read independently.

This thesis is based on ﬁve publications [101–105]. In addition, it draws on ideas fromfour others [97–100]. This section lists for each chapter the publication it is based on, aswell as the contributions of the author and co-authors.

Chapter 3 is based on

The Birth of Collective Memories: Analyzing Emerging Entities inText Streams , currently under submission at the Journal of the Association for InformationScience and Technology (JASIST). This publication is written with Odijk and de Rijke.Graus wrote the clustering and analysis code. Odijk contributed to the data parsingpipeline (i.e., adding timestamps to entity annotations). Graus performed the experimentsand the analysis. All authors contributed to the text.

Chapter 4 is based on

Generating Pseudo-ground Truth for Predicting New Concepts inSocial Streams which was published at ECIR 2014 by Graus, Tsagkias, Weerkamp, Buit-inck, and de Rijke. Graus implemented the general pipeline and algorithms, Tsagkias andWeerkamp contributed code to the sampling methods. Graus performed the experimentsand analyses. All authors contributed to the text.

Chapter 5 is based on

Dynamic Collective Entity Representations for Entity Ranking which was published at WSDM 2016 by Graus, Tsagkias, Weerkamp, Meij, and deRijke. Graus implemented the general pipeline and algorithms, Tsagkias and Weerkamp11 . Introduction contributed code to the sampling methods. Graus performed the experiments and analyses.All authors contributed to the text.

Chapter 6 is based on

Recipient Recommendation in Enterprises Using Communi-cation Graphs and Email Content which was published at SIGIR 2014 by Graus, vanDijk, Tsagkias, Weerkamp, and de Rijke. Graus implemented the general pipeline andalgorithms, Tsagkias and Weerkamp contributed code to the sampling methods. Grausperformed the experiments and analyses. All authors contributed to the text.

Chapter 7 is based on

Analyzing and Predicting Task Reminders which was publishedat UMAP 2016 by Graus, Bennett, White, and Horvitz. Graus performed the log-analysis,developed the task type taxonomy, implemented the algorithms for task reminder ex-ecution time prediction. Graus performed the experiments and analyses. All authorscontributed to the text. This work was done while on an internship at Microsoft Research.Finally, this thesis also indirectly builds on publications on entity linking for generatinguser proﬁles for personalized recommendation of events [98], a context-based entitylinking method [97], entity linking of search engine queries [100], and tweets [93]. Andpublications around Semantic Search for E-Discovery [99, 251].12

Background “Nescire autem quid ante quam natus sis acciderit,id est semper esse puerum.” —Cicero, De Oratore, XXXIV

The work presented in this thesis sits at the crossing between the ﬁelds InformationRetrieval (IR) and Natural Language Processing (NLP). This thesis deals with enablingand improving access to, and retrieval of information, and with improving and leveragingcomputer systems’ ability to process, understand, and extract information from naturallanguage. In this chapter, we present all relevant background and related work that servesas a basis for understanding this thesis. We start at the very basics, with a high levelexplanation of Information Retrieval and the current state of affairs and challenges, inSection 2.1. Next, we move on to explain Semantic Search, Entity Linking (EL), andKnowledge Bases in Section 2.2. In Section 2.3 we discuss work related to the ﬁrst partof this thesis, on analyzing, predicting, and retrieving emerging entities. Finally, wediscuss the related work for the second part of this thesis on analyzing and predicting userbehavior from digital traces, in Section 2.4.

IR is most prominent in our everyday lives in the search engines people consider theirentry point to the internet. Sometimes literally: a large share of queries issued to searchengines are so-called navigational queries — queries that represent the user’s intent ofvisiting a speciﬁc website [31]. To illustrate: the 10 most frequent queries issued to theYahoo search engine include: facebook , amazon , and yahoo [110]. Modern-daysearch engines have no trouble addressing these kind of information needs, quickly servingus websites from indices of many billions of “documents.”But in web search, the landscape is rapidly changing. At Google, the volume of mobilesearch has surpassed that of desktop search [95]. Around 20% of queries from mobiledevices in the US are voice input [208], and over half of the teenagers in the USA usevoice input to issue their search queries. Mobile web search changes the game. Voicequeries are longer on average, and use richer language than text queries [110]. Displaying13 . Background Figure 2.1: An example of a knowledge card displaying the weather forecast for SanDiego on the search engine result page of DuckDuckGo.results on a mobile device means a shift from the “10 blue link”-paradigm towards newer,richer, more reactive, and more interactive ways of presenting the user with results. Onesuch example is the proliferation of knowledge cards in many current web search enginesand intelligent personal assistants such as Apple’s Siri and Microsoft Cortana. Knowledgecards are small panels that summarize information to answer a query, e.g., the weatherforecast, or central information around an entity of interest. See Figure 2.1 for an exampleknowledge card.Even outside mobile web search, in the more “traditional” setting of desktop websearch, natural language queries (and in particular natural language questions) are onthe rise [262]. Compared to the aforementioned navigational queries, understanding themore complex information needs that come with longer queries is a more challengingtask. Web search engines do not adequately support more complex exploratory searchscenarios [115].Outside the domain of web search, plenty more complex information needs and searchtasks exist, e.g., (re-)ﬁnding email messages [37], desktop search [70], researching priorart in patents [132], or ﬁnding related work for a background chapter in a PhD thesis [249].The shift towards more complex queries in web search and the proliferation of com-plex search scenarios in other domains are just two examples of where the traditional IRapproach of matching words (in queries) to words (in documents), known as lexical match-ing , no longer sufﬁces. More complex information needs call for a better understanding ofthese information needs; they call for moving beyond words, towards better understandingof the intents, needs, and meaning that underlies queries and search behavior.

Semanticsearch [153] aims to ﬁll this gap, by incorporating meaning (i.e., semantics), naturallanguage structure, and (external) knowledge into the search process.

Before we delve into semantic search, we provide a brief overview of the traditionallexical matching-based retrieval models that we use in this thesis. Generally speaking,14 .1. Information Retrieval a retrieval model aims to assign a score of “relevance” (itself a very poor-understoodnotion) to documents ( d ) from an index, given a user-issued query ( q ). The most naiveway of quantifying how “relevant” a document is to a query, is to count the occurrencesof the query words in the document. This query term frequency-based method suffersone drawback: not all words are equally important in a document [164]. Weighting termsaccording to their importance alleviates this. TF-IDF

One such way of term weighting is to assign higher weights to terms that occur infewer documents: the underlying assumption is that words that occur frequently in adocument, but in few documents overall, are the most representative of the document.This assumption is modeled by combining a term’s frequency with its inverse documentfrequency [236]. This is the Term-Frequency, Inverse Document Frequency (

TF-IDF )weighting scheme:

Score ( q, d ) = (cid:88) t ∈ q tf ( t, d ) · log Ndf t . (2.1)Here, q is the user issued query, d the document, tf ( t, d ) the term frequency of query-term t in d , N the total number of documents (i.e., collection size), and df t the documentfrequency of query-term t in the collection. This is a tried-and-tested, i.e., old, but stillfrequently used scoring method. Another scoring method (that represents the state-of-the-art in lexical matching-based retrieval), Okapi BM25 [222], shares the same notion of termfrequency and inverse document frequency for estimating the relevance of a document toa query, with more elaborate techniques to account for query and document length. Language Modeling and the Query-Likelihood Model

Another way to estimate the relevance of a document to a query is to use languagemodeling. A statistical language model (LM) is a probability distribution over words [224],which we can use to estimate the probability (i.e., how likely it is) that a query is generatedby a particular LM (i.e., particular distribution). If we compute for each document in acollection its own LM, we can rank the documents based on the probability of generatingquery q . This ranking method is known as the query likelihood model , which postulatesthat documents with higher probabilities to generate the query should rank higher: Score ( q, d ) = P ( q | d ) = (cid:89) t ∈ q P ( t | d ) n ( t,q ) , (2.2)where the document LMs are derived from word occurrences in the documents, i.e., theprobability of observing a word given a document ( P ( w | d ) ) is typically deﬁned as: P ( w i | d j ) = n ( w i , d j ) (cid:80) N n ( w i , d j ) , (2.3)where n ( w i , d j ) is the frequency of term w i in document d j . Using these simple word-counts yields a unigram LM, also known as the bag-of-words model, where the order15 . Background of words (in both the query and the document) does not affect the score. Extending thismodel to incorporate word order yields n -gram models.While these models are simple and computationally cheap, they suffer one majordrawback. Both TF-IDF and the query likelihood model, but also more advanced models,e.g., BM25, rely on lexical matching, i.e., words from the query need to appear in thedocuments to get non-zero relevance scores. A fundamental aspect in IR is evaluation. In order to assess the performance and usefulnessof novel retrieval models, methods, and algorithms, it is central to be able to compare theoutput of different methods (typically rankings), to measure differences and improvements.These evaluation metrics rely on the availability of ground truth , i.e., a sets of assessments(e.g., relevance scores for documents per query), usually collected beforehand, which canserve as a golden standard to compare against.The topic of evaluation in IR is huge, and to address everything is beyond the scope ofthis thesis. Here, we restrict to listing and describing the evaluation metrics we employ inthis thesis, which we break down into set-based, rank-based, and classiﬁcation evaluationmetrics. We point the interested reader to [165] for a more comprehensive overview.

Set-based Metrics

Precision corresponds to the fraction of retrieved items (e.g., documents) in the system’soutput that are relevant (i.e., the items that are in the ground truth set). It is computedby taking the fraction of True Positives (TP), i.e., the set of retrieved items that are inthe ground truth , over all retrieved items of the system, i.e., the set of True Positives andFalse positives (FP), where False Positives correspond to the items that are in the system’soutput, but not in the ground truth , i.e.,

T PT P + F P . Recall is a related set-based metric, which corresponds to the fraction of relevantitems in the system’s output. It is computed by taking the fraction of True Positives (TP)over the set of True Positives and False Negatives (FN), where False Negatives correspondto items that are in the ground truth, but not in the system’s output, i.e.,

T PT P + F N .We employ Precision and Recall to compute our method’s effectiveness of predictingemerging entities in Chapter 4.

P@k is a variation of Precision that is helpful in ranking scenarios where the set ofground truth items is very large (e.g., in a web search scenario). P@k corresponds to thePrecision of a system’s output up to a certain rank ( k ). We employ P@k in Chapter 5. Rank-based Metrics

Precision and Recall are set-based metrics, which do not take the ordering of results intoaccount. To turn the set-based Precision into a metric that does take order into account,we can compute the

MAP (Mean-average Precision). First, Average Precision (AP) is theaverage Precision over all ranks (i.e., all values of k ), up to the point where recall is 1 (i.e.,the rank at which all relevant items are retrieved). By subsequently averaging AP overmultiple sets (e.g., over multiple queries), we yield MAP. We employ MAP in Chapter 5and Chapter 6.16 .2. Semantic Search Classiﬁcation Metrics

Accuracy is a classiﬁcation metric, i.e., it measures how well a system assigns class labelsto a set of samples, by computing the fraction of correct class labels (w.r.t. a set of groundtruth labels) over all assigned class labels. We employ accuracy in Chapter 7, wherewe address a multiclass-classiﬁcation task. We distinguish between micro- and macro-averaged accuracy.

Macro-averaged accuracy does not take different distributions inclass membership into account, and simply computes the accuracy by taking the total setof correct predictions over the total number of predictions.

Micro-averaged accuracy does take different class distributions into account, by computing the accuracy per class,and averaging these class-speciﬁc accuracies over the number of classes.

With the rise of more complex queries, information needs, search tasks, and challengingdomains, the lexical matching-based retrieval models described above may not sufﬁce.Semantic Search, or “search with meaning,” aims to move beyond the constraints ofsimple keyword matching, by incorporating additional “structure” in the search process.This can be in terms of semantic matching, e.g., by applying topic modeling methodsto project documents and queries in some (lower dimensional) semantic space, wherematching can happen without direct lexical overlap between query and document [153].In this thesis, however, we focus on one particular form of additional “structure:” real-world knowledge. By incorporating knowledge from external sources, e.g., knowledgebases such as Wikipedia, one can enrich text and improve matching [52, 114]. Butincorporating real-world knowledge also better supports the exploratory search processthat is inherent to E-Discovery [26, 268]. In particular, the real-world entities that occurin documents are central to question answering from text (the questions being, e.g., the 5W’s) [140].

Given a KB that describes real-world entities and concepts (e.g., Wikipedia), EL addressesthe task of identifying and disambiguating occurrences of entities in text. Or, morespeciﬁcally:Matching a textual entity mention, possibly identiﬁed by a named entityrecognizer, to a knowledge base entry, such as a Wikipedia page that is acanonical entry for that entity [216].EL is a key component in modern-day applications such as semantic search and advancedsearch interfaces, and plays a major role in accessing and populating the Web of Data [25].It can also help to improve NLP tasks [91], or to “anchor” a piece of text in backgroundknowledge; authors or readers may ﬁnd entity links to supply useful pointers [116].Another application can be found in search engines, where it is increasingly commonto link queries to entities to present entity-speciﬁc overviews [9], or to improve ad-hocretrieval [52, 114]. 17 . Background

EL can be traced back to several related tasks that preceded it, such as record linkage [80]in the Database community, and coreference resolution [235] in the NLP community.However, in its current form EL has received a lot of attention from the IR community, aswell as industry, since the Text Analysis Conference (TAC) Knowledge Base Populationtrack introduced an EL task in 2009 [174].The EL task can formally be described as follows: given an entity mention m (a termor phrase) occurring in reference document d , identify the entity e from a knowledge base KB that is the most likely referent of m . In this thesis, EL plays a central role in Part I(i.e., in Chapters 3, 4, and 5). EL consists of two distinct steps: (i) recognizing mentionsof entities in the knowledge base in text ( entity mention detection ), and subsequently (ii)linking them to their referent knowledge base entries ( entity disambiguation ). We brieﬂyexplain the core challenges and methods for both steps. Entity mention detection.

The ﬁrst step, known as entity mention detection, aims toidentify the word sequences that refer to entities ( entity mentions ). One approach is toapply named-entity recognition (NER). NER has a rich history, and has been studied sincethe 1990s in the Message Understanding Conferences [107]. Years later NER gained moretraction as part of the CoNLL shared tasks, where winning systems achieved around 90%accuracy of recognizing named-entities in news articles [243]. NER methods typicallyrely on learning patterns in sequences of words, by leveraging the structure of language,e.g., the functions of words and their surrounding words (entity mentions are typicallyproper names [243]), or features relating to their surface form (e.g., in many languagescapitalized words are more likely to refer to entities).An alternative approach for entity mention detection is so-called lexical matching,or dictionary-based mention detection [230]. This method leverages the rich metadataof a KB such as Wikipedia, by creating a lexicon or dictionary that maps entities (i.e.,Wikipedia page IDs) to entity surface forms (i.e., entity mentions). These surface formsare extracted from the rich variety of different ways used to refer to Wikipedia pages.For example, a single entity on Wikipedia can be represented by its: title, anchor textsfrom hyperlinks on other Wikipedia pages (i.e., the text that is used to hyperlink from oneWikipedia page to another), and redirect pages (mostly manually-added links that cover(common) misspellings or alternative names of Wikipedia pages), e.g., the Wikipediapage for

Kendrick Lamar has redirect pages such as

K-Dot (Kendrick Lamar’sformer stage name), and also misspellings, such as

Kendrick Lamarr and

KendrickLlama .After extracting all these candidate entity mentions and generating the mappings ofsurface forms to entities, statistics on their usage can be further leveraged to estimatethe probability that a (sequence of) word(s) is a reference to a KB entity (i.e., an entitymention). One commonly used and intuitive method of estimating the probability that an n -gram is an entity mention is the so-called keyphraseness . It boils down to the fractionof the number of times an n -gram is used to refer to an entity (i.e., the n -gram is used asan anchor text in Wikipedia), over the number of times the n -gram occurs in Wikipedia(including the number of times it is not part of a hyperlink) [181]. To illustrate, the phrase Kendrick appears 5,037 times in Wikipedia articles, of which it is part of a hyperlink Wikipedia dump dated June 2014 .2. Semantic Search (only) 24 times, resulting in Kendrick’s keyphraseness score of , = 0 . . In contrast,the phrase Kendrick Lamar appears 698 times, of which 501 times as a hyperlink, yieldinga keyphraseness score, i.e., the prior probability of the phrase

Kendrick Lamar being anentity mention, of = 0 . .Lexical matching-based EL approaches have shown to be successful in the challengingdomain of social streams [176], and have shown to be suitable for adaptation to new genresand languages [199]. As an added advantage, these methods are independent of language-dependent linguistic annotation pipelines, which are prone to cascading errors [85]. Theirdrawback, however, is that they are unable to identify entity mentions that are not in thelexicon, e.g., misspellings not covered by Wikipedia’s redirect pages. Entity disambiguation.

Given the detection of (candidate) entity mentions, the nextstep is to assign the referent entities from the KB to the mentions, or alternatively,determine that the mention refers to an entity that is not in the KB. Also known as entity disambiguation , this task is commonly approached by one of the following twoapproaches: (i) local and (ii) global entity disambiguation methods.Local methods attempt to resolve a mention (i.e., assign the correct entity to a mention)by only considering properties of a (candidate) entity mention and the candidate KBentities. A common local method of disambiguating entity mentions is by employing the commonness -score [181]. Commonness is a simple fraction of the number of times theentity mention is used as an anchor for a speciﬁc entity e , over the number of times it isused as an anchor to any entity in the KB [175]. Continuing with the previous example,the entity mention “Kendrick” is used as an anchor text of a hyperlink 24 times, of whichin 8 cases, the hyperlink points to the Wikipedia page Kendrick, Idaho , i.e., theprobability that the mention “Kendrick” refers to

Kendrick, Idaho is = 0 . ,whereas the probability that the mention refers to Kendrick Lamar is = 0 . .Commonness thus favors the most common target entity for a mention, which makesit work well in practice, particularly in domains that refer to popular entities by nature(e.g., news streams). Because of the emphasis on single occurrences of entity mentions,local EL methods have been shown to be highly effective in domains in which context islimited and/or noisy, e.g., microblog posts [176]. Additional methods have been proposedto expand context when it is limited [33, 42].There is an increasing interest in approaches that link multiple entities in a documentsimultaneously [130]. These so-called global methods are based on a notion of “coherence”in the set of entities in a document. There are several methods for deﬁning the “coherence”of a set of entities. One is to rely on the structural information of the KB, e.g., the link orcategory structure [182], leveraging categories and contexts [47], or by using graph-basedmetrics [109]. The TAC Knowledge Base Population (KBP) track has seen many systemsthat incorporate global approaches to EL [41, 48, 81, 130, 211]. Also, several publiclyavailable EL frameworks that aim to resolve each entity in a document apply globalmethods, e.g., ‘DBpedia Spotlight’ [177] and the ‘GLOW’ framework [217]. The mostcommon global approach uses entity “relatedness” [182]. Relatedness is based on theoverlap of the sets of “related” entities of a pair of entities. Related entities can be, e.g., allentities that are linked to or from the Wikipedia page that represents the entity of interest.A known challenge of global methods is that entity mentions in the document canthemselves be ambiguous, i.e., the number of pairwise comparisons between candidate19 . Background entities for multiple mentions is exponential in their number. The problem of optimizingthe global coherence function is NP-hard [47]. A common strategy to deal with thisproblem is to decrease the search space, e.g., by only considering non-ambiguous entitymentions (i.e., mentions that are only associated with a single target entity) [182]. Whilethis evidently reduces complexity, the presence of such entities cannot always be assumed.Another approach is to resolve the ambiguous entity mentions to their most popularcandidate [209]. Cucerzan [47] introduces a method for joint disambiguation that doesnot consider all combinations of assigning one disambiguation per surface form at atime. Instead, all possible disambiguations of a surface form (candidates) are consideredsimultaneously. Another method is to set a threshold on the number of entities to considerper mention (e.g., at most 20 candidate entities [217]). Ratinov et al. [217] train a localapproach to resolve individual entity mentions, before applying the global approach forcoherence. Entity linking in social streams.

The domain of social streams brings additional chal-lenges to the EL task. Whereas typical NER approaches perform in the upper 90% rangein terms of accuracy on “clean,” grammatically correct and correctly spelled news articles,the accuracy drops considerably when performing NER on misspelled, short, and noisytext, such as social media posts. The noisy character and unreliable capitalization ofsocial streams degrades the effectiveness of NER methods [61, 62, 85]. One approachto addressing this is to tailor the linguistic annotation pipeline to Twitter, which reliesheavily on large amounts of training data [27, 221]. Another approach to combat thelack of context, proposed by Cassidy et al. [42], is to expand a social media post withadditional posts of the author, and additional posts found by a clustering algorithm, beforeoptimizing the EL system’s output for global coherence. Finally, incorporating spatiotem-poral features [78] has been shown to improve EL effectiveness in social streams. Themethod we follow in this thesis however, embraces the little context and ignores the notionof global coherence, opting instead for machine learning methods with features that focuson learning mappings between individual n -gram and entities [176]. In Part I of this thesis, we focus on analyzing, predicting, and retrieving “emergingentities,” i.e., entities that are not (yet) part of the KB. This falls into the domain of KBconstruction (KBC) or (cold start) knowledge base population (KBP), which encompassesthe ambitious goal of creating a KB from raw input data (e.g., a collection of documents).KBP and KBC span many smaller subtasks, e.g., (new) entity mention detection, entityclustering, relation extraction, and slot ﬁlling [69, 193].We narrow our focus, and address the analysis, prediction, and representation ofemerging entities in three successive chapters. More speciﬁcally, in Chapter 3 we studywhether there exist common temporal patterns in how entities emerge in online textstreams, in Chapter 4 we propose a method that leverages EL to generate training data forpredicting new mentions of emerging entities in social streams, and ﬁnally, in Chapter 5we study a method for improving entity ranking by mining additional entity descriptionsfrom different external sources, e.g., social media and search engine logs.20 .3. Emerging Entities

In the ﬁrst chapter, we analyze entities as they “emerge;” i.e., we study the time series ofentity mentions in online text streams in the time span between their ﬁrst mention (i.e.,the entity surfaces) and the moment at which the entity is incorporated into the KB.Previous work on studying the expansion of Wikipedia through the addition of newpages studies the phenomenon from the perspective of Wikipedia itself, e.g., by analyzinghow newly created articles ﬁt in Wikipedia’s semantic network, studying the relationbetween activity on talk pages and the addition of new content to articles, or by studyingcontroversy and disagreement on new content through “edit wars” [134, 137, 272, 273].Studying emerging entities has received considerable attention from the natural lan-guage processing and information retrieval communities. Most notably, different methodsand systems for identifying and linking unknown or emerging entities have been pro-posed [120, 155, 192, 255]. More recently, F¨arber et al. [79] formalize and analyze thespeciﬁc challenges and aspects that come with linking emerging entities, while Reinandaet al. [218] study the problem of identifying relevant documents for known and emergingas new information comes in.

Addressing the task of detecting entities that are not in the KB, i.e., identifying emergingentities, can be traced back to the introduction of the “NIL clustering” subtask in theTAC KBP Entity Linking track in 2011 [130]. The task was a straight-forward ELtask: participants were given a reference KB (derived from Wikipedia), a collectionof documents (mostly news articles), and offsets for entity mentions that occur in thedocuments. The task was to decide for each mention whether it refers to an entity thatexists in the reference KB (and if yes, which), or whether it refers to a “NIL entity:” anentity that is not in the KB. In the latter case, the system needs to assign a new ID (NILID) to the mention, and assign other mentions that refer to the same underlying real-worldentity to the same NIL ID (i.e., cluster mentions to NIL IDs). Since common EL methodsrely on ranking candidate entities to mentions, the inclusion of mentions for out-of-KBentities is typically approached by learning a threshold (for the EL system’s conﬁdenceor retrieval score) to determine whether or not the mention refers to an entity alreadydescribed in the KB. This approach was chosen, e.g., by [35, 144, 192].In the TAC KBP Entity Linking track the entity mention detection step, shown to bethe bottleneck in state-of-the-art EL systems [111], was not part of the task. However, ina realistic scenario, when entity mentions are not given, Wikipedia lexicons and/or lexicalmatching-based approaches cannot detect unseen entity mentions of new entities, whichmeans different approaches need to be leveraged, e.g., named-entity recognition (NER).The method for detecting emerging entities with ambiguous names, presented by Hoffartet al. [120], leverages lexical matching methods for detecting entity mentions. However,in this approach, only ambiguously named emerging entities can be detected, makingit complementary to the method we present in Chapter 4, which focuses on detectingnew (i.e., unseen) mentions of emerging entities. Lin et al. [155] learn to identify entitymentions of absent entities by leveraging Google Books’ n-gram statistics. Their methodrelies on part-of-speech tagging to identify noun phrases in documents, and deciding for21 . Background each noun phrase that cannot be linked to a Wikipedia article whether or not it refers to an(absent) entity. Another way to model the task of detecting emerging entities is entity setextraction , where given a seed list of entity mentions, the task is to retrieve more entitiesthat are similar to the seed set [203]. However, the task differs as it does not rely on a KB,and consequently, the rich structure and metadata that comes with it.

Automatically generated pseudo-ground truth.

The approach we propose in Chap-ter 4 revolves around automatically generating training data using an EL system. Auto-matically generating training data for machine learning methods has been a longstandingand recurring theme in machine learning, dating back to the 1980s [241].The idea of using EL systems for generating pseudo-ground truth, or silver standard training data, is not new either. Guo et al. [108] used an EL system to collect additionalcontext words for entities, to aid the disambiguation step. Their mention detection steprelies on the Wikipedia lexicon, meaning their system is unable to detect unseen entitymentions. Similarly, Zhang et al. [278] apply EL to improve the disambiguation stepin EL, by replacing unambiguous mentions in entity linked documents with ambiguousones to generate artiﬁcial but realistic training data. Zhou et al. [279] generate trainingdata from Wikipedia to improve the disambiguation step as well. More speciﬁcally, theyconsider Wikipedia links as positive examples, and generate additional negative samplesby taking each other entity that may be referred to by the same anchor text.Attempts have been made to simulate human annotations for generating trainingdata for named-entity recognition, too. Kozareva [143] uses regular expressions forgenerating training data for NERC. Nothman et al. [195] leverage the anchor text of linksin between two same articles in different Wikipedia translations for training a multilingualNERC system. Wu et al. [269] investigate generating training material for NERC fromone domain and testing on another. Becker et al. [21] study the effects of samplingautomatically generated data for training NER.What is new to the method for predicting mentions of emerging entities we proposein Chapter 4, is that it directly optimizes for the task at hand, i.e., we generate trainingdata for a named-entity recognizer to detect a speciﬁc type of entity, i.e., entities that areworthy of being added to the KB.

A task closely related to EL is entity ranking: in the end, most EL systems approachthe disambiguation problem as an entity ranking problem. However, there is a hugedifference between ranking entities for a candidate entity mention, and ranking entities fora user-issued query. In particular, there may be a vocabulary gap between how users referto entities, and how entities are described in knowledge bases. The task of retrieving thecorrect entity for user-issued queries is of central importance in the general web searchdomain [36], but also in a discovery or exploratory search setting, where analysts may beinterested in retrieving entities with partial names, or nicknames.As with many NLP and information extraction tasks, research into entity rankingtoo has ﬂourished in recent years, largely driven by benchmarking campaigns, such asthe INEX Entity Ranking track [57–59], and the TREC Entity track [13, 14]. Entityranking is commonly addressed — much like EL — by exploiting content and structure22 .3. Emerging Entities of knowledge bases, for example by including anchor texts and inter-entity links [230],category structure [12, 226], entity types [135], or internal link structure [252].More recently, researchers have also started to focus on using “external” informationfor entity ranking, such as query logs, i.e., users’ past search behavior. Billerbeck et al.[24] use query logs of a web search engine to build click and session graphs, and walkthis graph to answer entity queries. Mottin et al. [190] use query logs to train an entityranking system using different sets of features. Hong et al. [123] follow a similar line ofthought and enrich their knowledge base using linked web pages and queries from a querylog.The entity ranking task is usually approached in a static setting, in which web pagesand queries are added to the knowledge base before any ranking is performed. One ofthe few initial attempts to bring in time in entity ranking is a position paper by Balogand Nørv˚ag [11], who propose temporally-aware entity retrieval, in which temporalinformation from knowledge bases is required. Finally, closing the loop, EL can beapplied to queries to improve entity ranking [114]. Entity ranking is inherently difﬁcultdue to the potential mismatch between the entity’s description in a knowledge base andthe way people refer to the same entity when searching for it.

Document expansion.

One way to bridge this gap is to expand (textual) entity repre-sentations with terms that people use to refer to entities. Collecting terms associated withdocuments, and adding them to the document (document expansion) is in itself not anovel idea. Previous work has shown that it improves retrieval effectiveness in a variety ofretrieval tasks. Singhal and Pereira [232] are one of the ﬁrst to use document expansion ina (speech) retrieval setting, motivated by the vocabulary mismatch introduced by errorsmade by automatic speech transcription. Ever since, using external sources for expandingdocument representations has been a popular approach to improve retrieval effectiveness.In particular, it was shown that anchors can improve ad-hoc web search [73, 180, 259,270]. Kemp and Ramamohanarao [138] state that document transformation using searchhistory, i.e., adding queries that lead to the document to be clicked, brings documentscloser to queries and hence improves retrieval effectiveness. Similarly, Xue et al. [271]study the use of click-through data by adding queries to clicked document representations.In this case, the click-through data includes a score that is derived from the number ofclicks the query yields for a single document. Gao et al. [92] follow a similar approach,but add smoothing to click-through data to counter sparsity issues. Amitay et al. [6] studythe effectiveness of query reformulations for document expansion by appending all queriesin a reformulation session to the top- k returned documents for the last query. Scholeret al. [226] propose a method to either add additional terms from associated queries todocuments or replace the original content with these associated queries, all with the goalof providing more accurate document representations.Looking at other sources for expansion, Bao et al. [15] improve web search usingsocial annotations (tags). They use the annotations both as additional content and as apopularity measure. Lee and Croft [147] explore the use of social anchors (i.e., contentof social media posts linking to a document) to improve ad hoc search. Noll and Meinel[194] investigate a variety of “metadata” sources, including anchors, social annotations,and search queries. They show that social annotations are concise references to entitiesand outperform anchors in several retrieval tasks. Efron et al. [71] show that document23 . Background expansion can be beneﬁcial when searching for very short documents (tweets).As an alternative to expanding entity representations, more recently, advances in neuralnets and deep learning have given rise to (continuous) entity representations, i.e., entityembeddings [28, 156, 234]. One of the drawback of these methods is that it is not trivialto employ them in an online, dynamic scenario, where representations may continuallychange. Fielded retrieval.

A common approach to incorporate different document expansionsinto a single document representation is to create ﬁelded documents [223, 277]. Based onﬁelded documents, a variety of retrieval methods have been proposed. Robertson et al.[223] introduce BM25F, the ﬁelded version of BM25, which linearly combines query termfrequencies over different ﬁeld types. Broder et al. [32] propose an extension to BM25F,taking into account term dependencies. Svore and Burges [239] use a machine learningapproach for learning BM25 over multiple ﬁelds, the original document ﬁelds (e.g., titleand body) and so-called “popularity” ﬁelds (e.g., anchors, query-clicks). Macdonald et al.[162] compare the linear combination of term frequencies before computing retrievalscores to directly using retrieval scores in the learning to rank setting and show that it ishard to determine a clear winner. P´erez-Ag¨uera et al. [206] note how one of the challengesin structured IR is the fact that ﬁeld importance differs among collections and that differentcollections means ﬁeld importance needs to be optimized per collection.

In Part II of this thesis, we study the link between their digital traces and activities. Miningdigital traces has a rich history in exploratory analyses and improving search engines andonline products, as discussed in Chapter 1. More speciﬁcally, large-scale user logs frommany users have been used for a range of purposes to improve online services and advanceour understanding of how people use systems. Search engine queries and search-resultclicks have been used to understand how people seek information online [260], trainsearch engine ranking algorithms to better serve user needs [4, 131], and more generally,teach us about how humans behave in the world [220]. Although large-scale log analysisof online behavior has typically focused on search and browsing activity, recent work hastargeted the large-scale usage of communication tools too, such as instant messaging [148]and email [142].In this thesis, we focus on mining digital traces with the aim to understand theunderlying behavior that gives rise to the traces. We address an email prediction task inChapter 6 and aim to gain insights into usage patterns with the reminder service of anintelligent personal assistant in Chapter 7.

Email has a rich history as an object of study in computer science, once more spurredby the availability of data, speciﬁcally, the acquisition and subsequent release of the“Enron corpus,” by computer scientist Andrew McCallumn [141] for a reported sum of$10,000 [168].24 .4. Analyzing and Predicting Activity

Research initially focused on information extraction tasks such as recognizing speechacts [38, 44], people-related tasks, such as name recognition and entity resolution [66, 76,183, 184], contact information extraction [10, 49], identity modeling and resolution [75],discovery of peoples’ roles [150], and ﬁnally email-related tasks, such as automatedclassiﬁcation of emails into folders [22], e.g., by leveraging the social network featuresof the email network [274]. The Enron email collection was also used very directly inan E-Discovery task in the TREC Legal [46, 117] tracks, which revolved around ﬁndingresponsive documents for a given production request (in litigation).Studying email collections as communication graphs (or social networks) has yieldedinsights into how communication patterns change and relate to real-world events (e.g., howinterpersonal communication intensiﬁed during Enron’s crisis period [67]), how importantnodes (= email addresses) or inﬂuential members of the network can be discovered bylooking at network properties [231], and how one can discover shared interests amongmembers of an email network [228].More recently, research interest in email has been rekindled. Spurred by publicationsfrom industry, where access to huge amounts of real email data and trafﬁc has resulted inwork on automated classiﬁcation of email into folders [106], predicting recipients’ follow-up actions for emails [51, 65], or predicting an email’s (personalized) “importance” [2],and inferring the activities that govern email communication [210].Finally, work on recipient prediction (the task we address in Chapter 6) generallyincluded the additional constraint that one or more seed recipient is given, a task alsoknown as CC-prediction [127, 202]. Previous attempts at the type of recipient predictionwe address leverages either the communication network, constructed from previouslysent emails [225], or the content of the current email [39]. Finally, motivated by pri-vacy concerns previous work typically addresses recipient prediction by restricting toa sender’s ego network for prediction (i.e., only considering the local network of thesender, as opposed to the global communication network). In Chapter 6 we ignore theseconstraints, motivated by our ofﬂine, batch, discovery scenario, and rather attempt to gaininsights into enterprise-wide communication patterns, and in particular the importance ofcommunication graph properties and email content features.

As mentioned in Chapter 1, interaction logs with personal intelligent assistants mayprovide rich digital traces for inferring people’s real life activities. In the case of intelligentassistants, analyzing user logs may support inferring the users’ intents [7] or currentactivities [204]. Where previous work addressed modeling long-term goals [18], theCortana reminder logs we study in Chapter 7 may help us in understanding short termgoals. To study the relation between a user’s reminders and real life activities and tasks,several areas of research are relevant. We focus largely on research on memory and(completing planned) tasks, and review research in the following areas: (i) reminders, (ii)memory aids, and (iii) prospective memory.

Reminders.

A number of systems have been developed to help remind people of futureactions [63, 64, 133, 146, 172], many of which leverage contextual signals for moreaccurate reminding. These systems can help generate reminders associated with a range25 . Background of future actions, including location, events, activities, people, and time. Two of the mostcommonly supported types of reminders are location- and time-based (and combinationsthereof [63, 169]). Location-based reminders ﬁre when people are at or near locations ofinterest [160, 245]. Time-based reminders are set and triggered based on time [83, 119]],including those based on elapsed time-on-task [43]. While time-based reminders canprovide value to many users, particular groups may especially beneﬁt from time-basedreminders. These include the elderly [173], those with memory impairments [136], andpeople seeking to comply with prescribed medications [113]. In Chapter 7 we studytime-based reminders in the Cortana reminder service, and omit location and person-basedreminders, which are less common in our data, and more challenging to study acrossusers as they rely heavily on personal context and relationships between the user and thelocations and persons that trigger the reminders.

Memory aids.

Memory aids help people to remember past events and information.Studies have shown that people leverage both their own memories via recall strategiesand the use of external memory aids to increase the likelihood of recall [129]. Aids canassume different forms, including paper [163] to electronic alternatives [17, 82, 219]. Oneexample of a computer-based memory aid is the Remembrance Agent [219], which useswords typed into the text processor to retrieve similar documents. People also leveragestandard computer facilities to support future reminding (e.g., positioning documentsin noticeable places on the computer desktop) [17], which is inadequate for a numberof reasons, including that the reminder is not pushed to the user [83]. Other work hasfocused on the use of machine learning to predict forgetting, and the need for remindingabout events [133]. Cortana is an example of an interactive and intelligent externalmemory aid. Studying usage patterns and user behavior lets us better understand users’needs, developing improved methods for system-user interaction and collaboration, andmore generally, enhance our understanding of the types of tasks where memory aids arenecessary.

Prospective memory.

Prospective memory (PM) refers to the ability to rememberactions to be performed at a future time [30, 72]. Beyond simply remembering, successfulprospective memory requires recall at the appropriate moment. PM failures have been anarea of study [74, 229]. Studies have shown that failures can be linked to external factorssuch as interruptions [50, 198]. Prospective tasks are usually divided into time-basedtasks and event-based tasks [72]. Time-based tasks are tasks targeted for execution at aspeciﬁc future time, while event-based tasks are performed when a particular situationor event occurs, triggered by external cues, e.g., a person or a location [30]. Laboratorystudies of PM have largely focused on retention and retrieval performance of event-basedPM, as this is straightforward to operationalize in an experimental setting. Time-basedPM is a largely overlooked type in PM studies [68], as this type of “self-generated” PMis difﬁcult to model in a lab setting. The Cortana reminder logs we study in Chapter 7represent a real-life collection of time-based PM instances. They provide insights in thetype and nature of tasks users are likely to forget to execute.26 .5. What’s next

The rest of this thesis is structured in two self-contained parts that can be read indepen-dently.In Part I of this thesis, our entities of interest are emerging real-world entities, i.e.,entities that are not described (yet) in publicly available KBs such as Wikipedia. First,we study how entities emerge in online text streams in Chapter 3. Next, we set out topredicting newly emerging entities in social streams in Chapter 4. Finally, in Chapter 5 weaddress the dynamic nature in which entities of interests may appear in online text streams,and leverage collective intelligence to improve the retrieval effectiveness of real-worldentities by capturing the changing contexts in which they may appear.Next, in Part II we switch our focus, and our entities of interest are the producers ofthe digital traces. Here, we present two case studies. First, we analyze and predict emailcommunication by studying the communication network and email content in Chapter 6,and then we analyze interaction data of users with an intelligent assistant, to predict whena user will perform an activity in Chapter 7.Finally, we summarize our ﬁndings, formulate implications and limitations of ourwork, and highlight some areas for potential future work in Chapter 8. 27 art I

Analyzing, Predicting, andRetrieving Emerging Entities Analyzing Emerging Entities “The two most important days in your life are the dayyou are born and the day you ﬁnd out why.” —Mark Twain

In the ﬁrst part of this thesis we focus on emerging real-world entities, i.e., entities that arenot described (yet) in a knowledge base. In this ﬁrst chapter, we take a closer look at howthese entities emerge. This chapter serves as an introductory study into emerging entities,our object of study, which we study in subsequent chapters in the contexts of predictingthem in social streams (Chapter 4), and enabling their effective retrieval (Chapter 5).We study entities that emerge in online text streams, and are subsequently added toWikipedia, the largest publicly available and collectively maintained KB. Wikipedia hasbeen dubbed a global memory place [205], where collective memories are stored [112].The moment when an entity is incorporated into Wikipedia, i.e., stored in collectivememory, can thus be considered parallel to the collective’s discovery of an entity. Studyingemerging entities, hence, provides insights into what the collective considers entities ofinterest .We take a macro view and aim to understand the properties of and circumstancesunder which entities of interest emerge. We answer the following research question:Are there common temporal patterns in how entities of interest emerge in online textstreams? (RQ1).Studying online text streams and Wikipedia allows us to make observations and gaininsights into the process of emerging entities at a large scale. Every day, new contentis being added to Wikipedia, with the knowledge base receiving over 6 million monthlyedits at its peak [238]. These edits may appear under different circumstances. Domainexperts may ﬁnd information missing on Wikipedia, and take up the task of contributingthis new information. Alternatively, as events unfold in the real world, new, previouslyunknown, and unheard-of entities (people, organizations, products, etc.) emerge intopublic discourse. These new entities emerge online in news articles and social media31 . Analyzing Emerging Entities oct nov dec jan2011 2012 feb mar apr may jun jul aug D o c u m e n t s Figure 3.1: Emergence pattern of the entity

Curiosity (Rover) , ﬁrst mentionedin our text stream in October 2011. The Wikipedia page for Curiosity was created ninemonths later, on August 6, 2012. There are two distinct bursts , one late November 2011,the second shortly before the entity is added to Wikipedia. The two document burstscorrespond to the Mars rover’s launch date (November 26, 2011) and its subsequent Marslanding (August 6, 2012). The time series shows us that while the launch did generatepublicity, yielding a burst of documents (and signaling increased attention for the entity),at that point it was not deemed important enough to be added to Wikipedia.postings that may describe or comment on events, e.g., the Olympics may introduce newathletes onto the world stage, a newly announced smartphone or video game console maygenerate a wave of activity on social media, or the opening of a new restaurant may bereported in local news media and pop up in social media.In this chapter, we analyze a sample of online social media and news text streams,spanning over 18 months and comprising over 579 million documents. We focus onthe emergence patterns of these entities, i.e., how a new entity’s exposure develops andevolves in the timespan between the entity’s ﬁrst mention in online text streams, and whenan article devoted to the entity is subsequently added to Wikipedia. More precisely, wedeﬁne an entity’s emergence pattern to be its “document mention time series,” i.e., thetime series that represents the number of documents that mention a speciﬁc entity overtime, starting at the moment it is ﬁrst mentioned, up to the moment it is incorporated intoWikipedia.An example of one such emergence pattern is shown in Figure 3.1. The time seriesshows the number of documents that mention a given entity per day on the y -axis (i.e., the emergence volume ), with the x -axis representing the time between the ﬁrst mention of theentity in the online text stream and the day it is added to Wikipedia (i.e., the emergenceduration ). Since emergence durations naturally differ between entities, our time seriesare of variable length. As we are interested in broad attention patterns, we study documentmention time series, as opposed to total mention volume. And because we want to study This is the English version of Wikipedia; see Section 3.3 for details on the data used in this chapter. .2. Research Questions global, broad, and long-term patterns, our time series are at a granularity of days, nothours.The main ﬁndings of this chapter are as follows. By clustering time series of mentionsof entities as these entities emerge in news and social media streams, we ﬁnd broadlytwo different emergence patterns: entities that show a strong initial burst around the timeof their introduction into public discourse, and late bursting entities that exhibit a moregradual emergence pattern. Furthermore, we ﬁnd meaningful differences between howentities emerge in social media and news streams: entities that emerge in social mediastreams exhibit slower emergences than those that emerge in news streams. Finally, weshow how different types of real-world entities exhibit different emergence patterns; weﬁnd that the entities that emerge fastest are entity types that know shorter life-cycles suchas devices (e.g., smartphones), and “cultural artifacts” (e.g., movies and music albums).We proceed by formulating three sub research questions in Section 3.2. The data thatwe use in our study is described in Section 3.3. In Section 3.4 we detail the methods usedto analyze the data. Results of our analysis are presented in Section 3.5. We conclude inSection 3.6. In studying emergence patterns of entities, we apply three different methods for groupingentities, representing alternative views on emerging entities. In Section 3.5.1 below, weapply a burst-based unsupervised hierarchical clustering method to cluster similar entityemergence patterns, so as to discover groups of entities with broadly common emergencepatterns. That is, we group entity time series by similarities in the peaks of comparativelyhigher activity (i.e., peaks of exposure in public discourse). This type of analysis is meantto help us answer the following research question:

RQ3.1

Can we discover distinct groups of differently emerging entities by clusteringtheir emergence time series?We show that entities emerge in two distinct patterns: so-called early bursting (EB) entities,that show a strong initial burst around the time of their introduction into public discourse,or late bursting (LB) entities, that exhibit a more gradual emergence pattern. EB entitiesare shown to exhibit shorter emergence durations than LB entities, i.e., they are deemedentities of interest more rapidly.In Section 3.5.2, we adopt an alternative view of emerging entities, and examinetheir emergence patterns in different types of text stream, comparing patterns betweenentities that emerge in news to social media streams. This view is motivated by perceiveddifferences in the nature of the professionally curated, authoritative, and high impact,“mainstream media” versus the user-generated, unedited, social media streams. We applytwo grouping methods. First, we group the time series by type of text stream, and providetheir descriptive statistics, hypothesizing that news streams will exhibit shorter emergencedurations than social media streams, due to their reach and impact. We also analyze thecross-pollination between the two types of stream, i.e., we study whether entities appearﬁrst in either of the streams, or whether they appear in both at the same time, etc. Theseanalyses help us answer the following research question: 33 . Analyzing Emerging Entities

RQ3.2

Do news and social media text streams exhibit different emergence patterns?We ﬁnd that news and social media streams show broadly similar emergence patternsfor entities. However, entities that ﬁrst emerge in social media streams on average areincorporated in the KB comparatively faster than entities that ﬁrst emerge in news streams.Finally, in Section 3.5.3, we study the similarities and differences between differenttypes of entities as they emerge in public discourse. Speciﬁcally, we leverage DBpedia,the structured counterpart of Wikipedia, to group emerging entities by their underlyingentity types, such as companies, athletes, and video games. This allows us to gain insightsinto whether different types of entities exhibit different emergence patterns. In addition, byconsidering entity types we study whether the news stream and the social stream exhibitdifferent focal points, i.e., where professionally curated news streams exhibit a focus onthe mainstream and user-generated social media streams surface more niche entities. Thisanalysis helps us answer the following research question:

RQ3.3

Do different types of entities exhibit different emergence patterns?We ﬁnd that different entity types exhibit substantially different emergence patterns.Furthermore, we ﬁnd that speciﬁc entity types are more or less likely to emerge in eitherthe news or social media streams, which can be largely attributed to the different natureof both streams, and their authors. Finally, we ﬁnd that entity types that have shorteremergence durations remain popular over time. The ﬁndings suggest that the entitytypes that are incorporated relatively fast in the KB, play a more central place in publicdiscourse.

As described in the introduction, we study the emergence patterns of entities by lookingat the lead time between the ﬁrst mention of an entity, i.e., its ﬁrst appearance in onlinetext streams, up to the moment they are incorporated into Wikipedia. To study theseemergence patterns, we generate a custom dataset of timestamped documents that areannotated retrospectively with links to Wikipedia, including for each link (i) the creationdate of the associated Wikipedia page, and (ii) whether the Wikipedia page existed at thetime the document was created. Our dataset spans 7.3 million documents with 36.2 millionreferences to 79,482 unique emerging entities. We create this dataset by extending anexisting document-stream dataset (TREC KBA’s StreamCorpus 2014 ) with an additionalset of annotations to Freebase (FAKBA1 ). We enrich the dataset by including Wikipediacreation dates of the Freebase entities, and the relative age of the entity to the document.We subsequently reduce this dataset by ﬁltering out documents that have “future leaks,”i.e., that contain references to entities whose Wikipedia pages were created after thedataset’s timespan. Below, we describe our data acquisition and preparation processes indetail. http://trec-kba.org/kba-stream-corpus-2014.shtml http://trec-kba.org/data/fakba1/ .3. Data Our dataset is derived from the

TREC KBA StreamCorpus 2014 , a dataset of roughly1.2 billion timestamped documents from various sources, e.g., blog and forum postsand newswires. It spans 572 days (or roughly 18 months) from October 7 th , 2011 toMay 1 st , 2013 (line 1 in Table 3.1). The StreamCorpus is composed of multiple subsets,with slightly different periods and volumes (i.e., numbers of documents). All documentsin the corpus that were automatically classiﬁed as being written in English have beenautomatically tagged for named entities with the Serif tagger [29], yielding roughly 580Mtagged documents. This annotated subset of documents was the ofﬁcial collection for theTREC Knowledge Base Acceleration (KBA) task in 2014 [89].Dalton et al. [53] further annotated these 580M documents with Freebase entities,resulting in the Freebase Annotations of TREC KBA 2014 Stream Corpus, v1 (FAKBA1)dataset , which spans over 394M documents (line 2 in Table 3.1). It is essential to ourstudy that the Freebase dump used for linking the entities is dated after the span of theStreamCorpus. Because of this, we can isolate entities that are mentioned (i.e., linked) indocuments before their corresponding Wikipedia page was created.To extract Wikipedia page creation dates for the entities present in the TREC KBAStreamCorpus 2014, we ﬁrst map the Freebase entities linked in the FAKBA1 collectionto their corresponding Wikipedia pages, using the available mappings in Freebase. Weextract Wikipedia page creation dates from a dump of Wikipedia with the full revisionhistory of all pages ( enwiki-latest-page-meta-history.xml ). We appendthe Wikipedia page creation date or entity timestamp (denoted e T ) to each entity inFAKBA1. In addition, we include the entity’s “age” relative to the document timestamp( doc T ): the difference between the Wikipedia page creation date and the documenttimestamp, i.e., e age = e T − doc T . The resulting dataset, FAKBA1, extended with theentity age and entity timestamp, is denoted FAKBAT (Freebase Annotations of TRECKBA 2014 Stream Corpus with Timestamps), see line 3 in Table 3.1.As a next step, we ﬁlter to retain only documents that contain emerging entities.Emerging entities are entities with e age < , i.e., entities mentioned in documents datedbefore the entity’s Wikipedia creation date. We denote the resulting subset of FAKBATdocuments with emerging entities OOKBAT (Out of Knowledge Base Annotations (with)Timestamps). This yields a set of nearly 24M documents (line 4 in Table 3.1).To be able to study an emerging entity’s complete emergence pattern, we take twoadditional ﬁltering steps. First, we prune entities with creation dates later than the lastdocument in our stream, to ensure the entities emerged in the timespan of our documentstream, i.e., we remove all entities whose Wikipedia page has a creation date later thanMay 1 st , 2013. Next, we prune all entities that are mentioned in fewer than 5 documents,to be able to visualize and study their time series. This yields our ﬁnal dataset, whichcomprises 79,482 emerging entities (line 5 in Table 3.1). Entity Types

In the analysis in Section 3.5.3, we leverage an entity’s “class” from the DBpedia ontol-ogy. Freebase provides mappings to Wikipedia and its structured counterpart DBpedia. http://mappings.dbpedia.org/server/ontology/classes/ . Analyzing Emerging Entities Table 3.1: Descriptive statistics of our dataset acquisition. Coverage over precedingdataset in brackets. Looking at the second and third row in the table, we note that roughlytwo-thirds of the FAKBA1 entities can be mapped to Wikipedia. However, this portionrepresents 98% of the mentions. Closer inspection showed that the missing one-third,i.e., the entities we could not map to Wikipedia creation dates, were Freebase entitiesthat were not linked to their Wikipedia counterparts, most notably, WordNet concepts andentities from the “MusicBrainz” knowledge base (i.e., artists, albums, and artists). Thelast two rows show that one in ten of the entities emerge during the span of the dataset,however, they constitute a mere 1% of the mentions.

Dataset TREC KBA [89] N/A N/A 579,838,2462.

FAKBA1 [53] 3,272,980 9,423,901,114 394,051,027 (68.0%)3. FAKBAT 2,254,177 (68.9%) 9,221,204,641 (97.8%) 394,051,027 (100%)4. OOKBAT 225,291 (10.0%) 94,929,292 (1.0%) 23,896,922 (6.1%)5. Emerging entities 79,482 (35.3%) 36,242,096 (38.2%) 7,291,700 (30.5%)In the DBpedia ontology, an entity is assigned to one or more classes in a tree-like classstructure. We map each of our emerging entities to the classes assigned in DBpedia, e.g.,the entity

Barack Obama is mapped to the Person, Politician, Author, Award Winnerclasses. We extract these mappings by extracting all triples that have rdf:type property(e.g.,

Barack Obama Person ). Out of the 79,482 emerging entities inour dataset, we have 39,713 class-mappings (i.e., a coverage of 50.0%).

Entity Popularity

Finally, as a proxy for an entity’s popularity, which we use to study the composition ofclusters and the different substreams in Section 3.5.3, we extract Wikipedia pageviewstatistics. More speciﬁcally, we extract the total number of pageviews each entity receivedduring 2015. We choose to use the pageview counts of a year that falls outside of thetimespan of our dataset so as to minimize the effects of timeliness (i.e., we want to separatethe true “head” entities from the ones that have a shorter lifespan).

We use the Emerging Entities dataset that we have created (line 5 in Table 3.1) to generatetime series that describe their emergence. Figure 3.2 shows the document volume overtime of the different streams in our dataset. It shows a larger number of news documentscompared to social media documents in the ﬁrst half of the data, and a larger number ofsocial media compared to news documents in the second half. In total, the news streamcomprises 1,836,022 documents, making it substantially smaller than the social mediastream, at 5,357,014 documents. The news and social media streams themselves arecomposed of multiple datasets from different sources [88], which explains the gap seenaround May 2012 up to somewhere in July 2012.36 .4. Methods D o c u m e n t s Figure 3.2: Total document volume (i.e., count) for the news (green line) and social media(grey line) text streams over time ( n = 23,896,922). The streams are composed of multipleseparate datasets, explaining the gap in May through mid-July of the news stream. (Bestviewed in color.)We note that 79,482 entities emerge during the 18+ month (572 days) timespan of theTREC KBA StreamCorpus 2014 dataset, i.e., on average over 138 new entities emergeper day. Looking at the distribution at which these entities emerge over time, we observein Figure 3.3 that they do not emerge uniformly. In particular, Figure 3.3 shows a gradualincrease of emerging entities between the start of our dataset and May 2012, at which itpeaks. The subsequent gap can be explained by the absence of the news stream duringthat time (see also Figure 3.2).The core unit in our analysis are so-called entity document mention time series , i.e.,time series that represent the number of documents that mention an entity over time (see,e.g., Mars Curiosity ’s document mention time series in Figure 3.1). These timeseries are characterized by several properties. First, the time series are of variable length:each entity’s time series starts at the ﬁrst mention of the entity in our dataset, and ends atthe day the Wikipedia page for that entity was created. For some entities, the time seriesmay span several days, whereas others may span months. Second, the time series in ourdataset are not temporally aligned. They exhibit different absolute timings, where the dateof the ﬁrst mention (i.e., the start of the time series) and last mention (i.e., the end of thetime series) varies between emerging entities.

Our ﬁrst research question, Can we discover distinct groups of differently emergingentities by clustering their emergence time series? (RQ1), revolves around discoveringcommon emergence patterns. We apply a time series clustering method for discoveringgroups of entities with similar emergence patterns. In this section, we ﬁrst describe37 . Analyzing Emerging Entities

Figure 3.3: Distribution of emerging entities in our dataset ( n = 79,482). The x -axisshows the timespan of our dataset in days, the y -axis shows the number of entities thatemerge in our dataset on that day (i.e., have their Wikipedia pages created).our time series clustering method, explain and motivate the choices for representing theemerging entity’s time series. After that, we describe the general time series analysismethods that we use. Clustering time series consists of three steps: First, we need to normalize our timeseries as they might span very different periods of time. Next, we measure the similaritybetween time series by applying a similarity metric. And third, we apply a hierarchicalagglomerative clustering method to group entities in groups with similar emergencepatterns.

Normalization

As described in Section 3.3.2, the time series in our dataset might span different periodsand are not temporarily aligned. For these reasons, we cannot rely on time series analysisand modeling methods that leverage aligned time series or seasonal patterns. Because ofvariable lengths, we cannot leverage similarity methods that assume a correspondencebetween the data points between two time series such as, e.g., Euclidean distance. Fur-thermore, to be able to visualize clusters and groups of similar time series, we linearlyinterpolate the time series to have equal length [214]. Finally, as we are interested inthe similarity in emergence patterns, not in individual differences between popular andlong-tail entities (i.e., absolute number of mentions), we standardize all time series bysubtracting the mean and dividing by standard deviation to account for the differences involume/popularity [253].38 .4. Methods

Figure 3.4: Emergence pattern of

Curiosity (rover) . (Best viewed in color.)

Burst Similarity

Typically, time series similarity metrics rely on ﬁxed-length time series, and leverageseasonal or repetitive patterns [154]. But as noted above, our time series are of variablelength, and not temporally aligned. For this reason, common time series similarity metricssuch as Dynamic Time Warping (DTW) are not applicable [23]. We are interested in themoments at which the attention or focus around an entity in public discourse increases,i.e., we are looking for periods with higher activity. These so-called time series “bursts”may be correlated to real-world activity and events around the entity.To address the nature of our time series as well as our focus on bursts, we turn toBSim [253] (Burst Similarity) as the similarity metric we leverage to compare time series.It relies on detecting bursts, and using the overlap in bursts between time series as thenotion of similarity.For burst detection, we compute a moving average for each (raw) entity documentmention time series ( T e ), denoted T MAe . We set our parameter w , indicating the size ofthe rolling window to 7 days. Bursts are the points in T MAe that surpass a cutoff value ( c ).We set c = 1 . · σ MA , where σ MA is the standard deviation of M A . These parameterchoices for w and c are in line with previous work [253]. Figure 3.4 shows an exampletime series ( T e ), with the bursts detected for the emerging entity Curiosity (rover) .The detected bursts correspond to the earlier described launch and landing of the MarsRover.

Hierarchical Agglomerative Clustering

Now that we can measure similarity between time series, we need to identify clustersof similar time series. To this end, we compute the Similarity Matrix ( SM ) with allpair-wise burst similarities. More speciﬁcally, to cluster emerging entities, we ﬁrstapply L normalization to SM , and then convert it to a distance matrix DM ( DM = . Analyzing Emerging Entities − SM ). Next, we apply hierarchical agglomerative clustering (HAC) on DM usingthe fastcluster package [191] to discover groups of similar time series at differentlevels of granularity. We apply Ward’s method as our linkage criterion. Ward’s methodis an iterative approach, where one starts with singleton clusters, and aims to merge thepair of samples that maximally decreases the within-cluster variance at each successiveiteration [257].

Throughout this chapter, we take an exploratory approach to analyzing, visualizing, andcomparing patterns of groups of time series to discover meaningful clusterings [254]. Weapply different grouping strategies, e.g., implicit groups, such as the clusters we ﬁnd byusing our clustering method, or explicit groups, e.g., entities that emerge either in socialmedia or in news streams.We apply two analysis methods to study different groups of time series: (i) visualiza-tion of group signatures that represent common emergence patterns within a group of timeseries, and (ii) analysis of descriptive statistics that reﬂect properties of the underlyingtime series.

Visualization

To compare patterns and trends across different groups of emerging entities, we visualizeand compare the so-called group signatures , i.e., the average of all time series that belongto a group. See Figure 3.6 below for an example of the group signature of all emergingentities in our dataset ( n = 79,482).Two challenges speciﬁc to the time series that we study in this chapter arise whenvisualizing them. First, their duration . The time series we study in this chapter (may)differ in length, as emergence durations differ between entities. Second, their alignment .The time series are not temporally aligned, as their start (i.e., x = 0) is marked when theentity is ﬁrst mentioned in the online text stream, and end at the time at which the entityis incorporated into Wikipedia. For these reasons, we linearly interpolate the time seriesto the (overall) highest emergence duration, effectively “stretching” them to have equallength. Next, we align them in relative duration, i.e., the ﬁrst and last mentions for eachentity is set at the start and end of the x -axis, respectively. This allows us to visualize boththe clusters themselves, and the corresponding cluster signatures. Descriptive Statistics

While studying group signatures of time series allows us to discover similar patterns andstudy and compare broad patterns and trends, they do not paint the full picture. More ﬁne-grained properties of emergence patterns, e.g., the average emergence duration (the timebetween a new entity’s ﬁrst mention in the text stream and its subsequent incorporationinto Wikipedia) and the emergence volume (the total number of documents that mentionthe entity as it emerges) are difﬁcult to convey through visualization alone. In order tostudy these aspects, we represent the groups of time series through different descriptivefeatures that reﬂect the emergence and burst behavior of the time series that belong to40 .5. Results

Table 3.2: Descriptive statistics used for analyzing and comparing different groupingsof emerging entities. We distinguish between time series statistics (top three) and burststatistics (bottom three). All statistics are computed for the period ranging from theemerging entity’s ﬁrst mention in the corpus to the creation date of the Wikipedia pagedevoted to it. The burst durations and values are computed over the normalized time series(see Section 3.4.1).

Emergence volume

Number of documents that mention the entity

Emergence duration

Number of days from ﬁrst mention to incorporation

Emergence velocity

V olumeDuration (average number of documents per day)

Burst number

Total number of bursts

Burst duration

Normalized average durations of bursts (i.e., burst widths)

Burst value

Normalized average heights of bursts (i.e., burst heights)a group. For an overview of the emergence and burst statistics that we consider, seeTable 3.2.

Chi-square Goodness of Fit Test

In Section 3.5.3 we compare the distribution over entity classes per online text stream(i.e., comparing the entity classes in the social media stream to those in the news stream).To assess whether the differences in these class-distributions are statistically signiﬁcantlydifferent, we apply a chi-square goodness of ﬁt test to both distributions. In addition, werank the classes by their contribution to the difference, using chi- grams , i.e., we computefor each class: observed − expected √ expected , where observed corresponds to the number of entitiesthat belong to a particular class in one set (e.g., the entities that emerge in social media)and expected corresponds to the number of entities that belong to a particular class in theglobal population (i.e., all entities that emerge in our dataset). In this section, we present the analyses that answer our three research questions. First,we study the time series clusters that result from our clustering method in Section 3.5.1.Next, in Section 3.5.2, we study similarities and differences between emergence patternsin social media and newswire streams. In Section 3.5.3, we study the underlying entitytypes, and their emergence patterns.

The ﬁrst research question we aim to answer, Can we discover distinct groups of differentlyemerging entities by clustering their emergence time series? (RQ1), is at the core of ourstudy into emerging entities. Finding similar patterns and studying how an entity appearsin online text streams before it is added to Wikipedia allows us to gain insights into thecircumstances in which entities emerge, i.e., in which they are deemed entities of interest .41 . Analyzing Emerging Entities

Table 3.3: Global time series and burst descriptive statistics.duration ( ± std med. mean ± std med. mean ± std med.245 ±

153 221 183 ± ± ± ± ± global emergencepatterns , by taking the time series that are at the root node of the tree (i.e., the ﬁrst levelof the tree, denoted Top level in Figure 3.5). Then we study the main two clusters, at thenext level of the tree (denoted

Level 1 in Figure 3.5). Finally, move down another level inthe tree and study the set of six clusters at

Level 2 in Figure 3.5.

Global Emergence Pattern

First, we examine the emergence pattern of the global average, by taking the group atthe root node of the cluster tree (i.e., all time series in our dataset, where n = 79,482).Figure 3.6 shows the global emergence signature, and Table 3.3 shows the associateddescriptive statistics.Figure 3.6 shows how both the emerging entities’ introduction into public discourse(the ﬁrst mention at the left-most side of the plot) and subsequent incorporation intoWikipedia (the right-most side of the plot) occur in bursts of documents, i.e., overall, thelargest number of documents that mention a newly emerging entity are either at the start orat the end of their time series. This can be explained as follows. The entrance into publicdiscourse represents the ﬁrst emergence of an entity, whereas being added to Wikipedia islikewise likely to happen in a period of increased attention, e.g., a real-world event thatputs the entity in public discourse. Between these two bursts, the number of documentsthat mention the entity seems to increase gradually as time progresses, suggesting that onaverage, the number of documents that mention a new entity, and thus the attention theentity receives in public discourse increases over time before it reaches “critical mass,”i.e., when the entity is deemed important enough to be incorporated into the KB.Next, we look at the global emergence pattern’s descriptive statistics; see Table 3.3.Here, we see how on average it takes 245 days between an emerging entity’s ﬁrst mentionand being added to the KB. In these 245 days, an emerging entity is mentioned in 183documents on average. However, both the emergence durations and emergence volumesshow large standard deviations (of 153 days and 1,180 documents, respectively), indicatingthat they differ substantially between entities, further motivating our clustering approach42 .5. Results L e v e l L e v e l T o p l e v e l ( n = , ) F i gu r e . : D e nd r og r a m r e s u lti ng fr o m a pp l y i ngh i e r a r c h i ca l a gg l o m e r a ti v ec l u s t e r i ngu s i ng B S i m [ ] s i m il a r it y , onou r c o r pu s o f e m e r g i ng e n tit y ti m e s e r i e s ( n = , ) . T h e t h r eec u t o ff- po i n t s a t w h i c h w ea n a l y ze t h ec l u s t e r s a r e d e no t e d T op l e v e l , L e v e l ( c l u s t e r s ) , a nd L e v e l ( c l u s t e r s ) . F o r c l a r it y , t h e t r ee i s t r un ca t e dby s ho w i ngno m o r e t h a n7 l e v e l s o f t h e h i e r a r c hy . . Analyzing Emerging Entities Figure 3.6: Global cluster signature (of all emerging entities) where n = 79,482. That is,the top level in the dendrogram in Figure 3.5. The axes are not labeled since all time seriesvalues (i.e., document counts) are standardized, and the series are linearly interpolated tohave equal length. The solid line represents the cluster signature (i.e., the average timeseries), the lighter band represents standard deviation.to zoom into the different underlying patterns. The emergence velocity shows how, onaverage, an entity is mentioned in less than a document a day.To better understand how the documents that mention an emerging entity are dis-tributed over time, e.g., to gain insights into whether they are concentrated in a few bursts,or spread out more uniformly over the timeline, we turn to the burst statistics in Table 3.3.On average, an entity is associated with 3.8 bursts, indicating that entities are likely toresurface multiple times in public discourse before being deemed important enough tobe added to the KB. The average burst durations span a mere 3% of an emerging entity’stime series, indicating that emerging entities spend the majority of their time “under theradar.” The heights of these bursts (i.e., burst values) show a comparatively large standarddeviation, suggesting that the heights differ substantially between different entities andbursts.In summary, globally, entities experience a long time span between surfacing in onlinetext streams and being incorporated into the KB; they are associated with multiple burstsand thus display a resurfacing behavior. Finally, the large standard deviations seen at thedescriptive statistics suggests the entities show large variations in terms of their emergencepatterns. Below, we study whether grouping the time series of new entities by similarburst patterns allows us to ﬁnd groups of broadly similarly emerging entities in terms oftheir group signatures, and emergence and burst-features. Clusters at Level 1 in Figure 3.5: Early vs. Late Bursts

In our ﬁrst attempt at uncovering distinct patterns in which entities emerge, we look at theﬁrst two main clusters that appear at Level 1 of the cluster tree in Figure 3.5. The resultingcluster signatures are shown in Figure 3.7.Much like the global cluster signatures we studied in the previous section, the clustersat Level 1 show two main bursts: the initial burst around the ﬁrst mention, and the ﬁnalburst around the last mention before an entity is added to Wikipedia. However, as wewill show, the left cluster, which we call early bursting (EB) entities , is characterized44 .5. Results EB n = 31,589 (39.7%) LB n = 47,893 (60.3%) Figure 3.7: Cluster signatures of the early bursting entities (left plot) and late burstingentities (right plot) clusters, denoted Level 1 in Figure 3.5. The solid lines represent thecluster signature (average), the lighter bands represent the standard deviation.by a stronger initial burst, with the majority of the documents that mention the entityconcentrated at the time around the ﬁrst mention when the entity surfaces in the onlinetext streams. This suggests that the cluster contains new entities that suddenly emerge andexperience a (brief) period of lessened attention, before being incorporated into the KB.The right cluster, which we denote as late bursting (LB) entities , shows a more gradualpattern in activity towards the point at which the entity is added to the KB, much like wesaw in the global signature (Figure 3.6).Looking at Figure 3.7, we note two main differences between the group signaturesof the early bursting and late bursting entities. First, the way documents are distributedbetween the initial and ﬁnal burst. The EB entities cluster show a more “abrupt” ﬁnalburst: the signature shows the majority of the documents in the wake of the initial burst,i.e., at the left-hand side of the plot. In the next phase, the volume of documents seemsto gradually wind down into a relatively quiet period, which ﬁnally seems to abruptlytransition into the ﬁnal burst at the right hand-side of the plot. In contrast, the LB entitiescluster shows a relatively subtle initial burst, which like the EB-cluster appears to quietdown, followed by a gradual increase of document volume that leads up to the ﬁnal burst.A second difference is the height difference between the initial and ﬁnal bursts. TheEB cluster shows roughly equally high initial and ﬁnal bursts. The LB cluster shows asubstantially smaller initial burst, which suggests the introduction into public discourseis comparatively subtler than its addition, i.e., these entities may emerge “silently,” sug-gesting the entities are less central, more niche, and less widely supported. In summary,the ﬁrst split separates the entities that are associated with a strong “initial” emergence inonline text streams from the entities that more gradually build up to the moment at whichthey are incorporated into the KB.Next, to better understand the different characteristics of the two clusters, we study thedescriptive time series and burst statistics of the time series in the early bursting (EB) andlate bursting (LB) entity clusters in Table 3.4. Before we proceed, we determine whetherthe differences between the statistics of the two clusters are statistically signiﬁcant. Todo so, we perform a Kruskal-Wallis one-way analysis of variance test, also known as anon-parametric ANOVA. Following this omnibus test, we perform a post-hoc test usingDunn’s multiple comparison test (with p-values corrected for family-wise errors usingHolm-Bonferroni correction). Comparing all descriptive statistics from the two clusters,we ﬁnd that all differences are statistically signiﬁcant at the α = 0.05 level. 45 . Analyzing Emerging Entities Table 3.4: Comparison of early bursting and late bursting entities clusters statistics.proportion duration ( ± std med. mean ± std med. mean ± std med.EB 0.40 224 ±

146 195 118 ±

804 22 0.70 ± ±

156 238 225 ± ± ± ± ± ± ± ± emergence duration of 259days, versus 224 for EB entities), and receive more attention (with an emergence volume of 225 documents, versus 118 documents for EB entities) before being deemed importantenough to be incorporated into the KB. The EB entities’ shorter emergence durations andlower emergence volumes suggest a higher “urgency” or timeliness, suggesting this clustermay contain entities that represent sudden events, e.g., large-scale natural catastrophesor societal events, that will typically be incorporated in Wikipedia soon after they ﬁrstemerge in public discourse. The descriptive statistics of the LB entities too may indicatethey comprise less timely or urgent entities, e.g., recurring events, such as sports eventsthat may appear in public discourse long before being part of the KB.This view of the slower, less timely LB entities, and the more urgent, fast, and timelyEB entities is supported by the burst statistics. First, the average burst heights of EBentities are higher (at 0.05 on average, versus 0.03 for LB entities), suggesting LB entitiessee a more evenly spread volume of documents that mention them. Next, EB entities showa lower number of bursts (3.22 on average, versus 4.12 for LB entities). Fewer and higherbursts, together with shorter emergence durations and lower emergence volumes supportsthe view of more urgent or timely EB entities, i.e., those that are more quickly (in termsof time and number of documents) incorporated into the KB, while exhibiting more burstypatterns. In the next section, we show that entities that emerge in news likewise exhibithigher burst heights and fewer bursts on average, further exploring the notion of higherurgency or importance.In summary, we ﬁnd that the two clusters at Level 1 in Figure 3.5 differ substantiallyand signiﬁcantly in terms of their emergence patterns and burst properties. Our visualinspection of the cluster signatures and the analysis of burst and emergence featuressuggests LB entities emerge more slowly, i.e., build up attention more slowly before beingadded to the KB, whereas EB entities are associated with more sudden and higher burstsof activity, prior to being incorporated into the KB. Clusters at Level 2 in Figure 3.5

Next, we look at the clusters at Level 2 in Figure 3.5, i.e., the different clusters that makeup the LB and EB clusters. For brevity, we refer to these clusters as subclusters . Figure 3.8shows the signatures of the six subclusters. In general, the distinction between faster EBentities, and comparatively slower LB entities, remains in the subclusters.46 .5. Results

Using the procedure described in the previous section, we test whether the differencesbetween the subclusters are statistically signiﬁcant. Overall, the properties of time serieswithin each subcluster different signiﬁcantly, except for early bursting 1 and late bursting2a, which may not differ statistically signiﬁcantly in terms of emergence volume. And interms of burst statistics, some clusters show inconclusive differences. Speciﬁcally, earlybursting 1 and 2a do not differ statistically signiﬁcantly in terms of burst values, and earlybursting 2a and late bursting 2b do not differ statistically signiﬁcantly in terms of burstdurations.The cluster signatures in Figure 3.8 show that the entities that belong to the LB entitiessubclusters (the top three plots) exhibit an increase in document volume in the periodleading up to the ﬁnal burst before entering into the KB. This suggests that even whenthe entity ﬁrst appears in online text streams with a relatively big burst, its subsequentincorporation into Wikipedia is not instantaneous. Consider, e.g., the previously shownexample of the Mars Curiosity (in Figure 3.4); whereas its ﬁrst burst happens aroundOctober 2011, it is added to Wikipedia after a relatively quiet period, 9 months later.Whereas the LB entities subclusters both differ in the “steepness” of the ﬁnal burst and therelative moment of the increase of activity, the three EB entities subclusters (the bottomthree plots) differ mainly in the moment of the increased activity, i.e., the “bump” in theplot.One outlier exists in the LB entities cluster: late bursting 2a (second plot fromthe top). Its cluster signature shows a high and sudden burst of activity just beforethe entity is incorporated into Wikipedia, seemingly omitting the period of graduallyincreasing attention between the initial and ﬁnal bursts seen at the other two subclusters.Its emergence features in Table 3.5 show that with an average duration of 146 days beforean entity is added to the KB, it is almost twice as fast as the global average. Both the clustersignature and the average number of bursts (1.6 on average) show that the entities exhibita single, steep, high volume burst before being incorporated into the KB. Compared to theother two subclusters in the same LB group, the number of bursts is substantially lower,and the average burst values are substantially higher. This suggests that the entities in theLB 2a cluster comprise abruptly or suddenly emerging entities, like events, and indeed,upon closer inspection, events such as the World Music Festival Chicago, the 2013 BritishGrand Prix, the 2012 Volvo World Match Play Championship, and the 2012 SundanceFilm Festival belong to this cluster.With the exception of this outlier, the entities in the LB subclusters show longeremergence durations and substantially higher emergence volumes on average, furthersupporting our distinction of slowly emerging entities. In particular, late bursting 1 andlate bursting 2b (the third and ﬁrst plot from the top, respectively), with 242 and 227documents on average, with relatively long emergence durations (255 and 294 days onaverage, respectively), suggest that these clusters contain slowly emerging niche entities,that are not as widely supported by “the masses” as, e.g., the entities that are part of theLB 2a cluster. Manual inspection of the subclusters conﬁrms this, revealing entities suchas Summerland Secondary School, but also a long-tail of person entities (e.g., AFL athleteRodney Filer, jazz musician Kjetil Møster) in LB 1, whereas entities that represent more“substantive” events are in LB 2a, e.g., the 2012 Benghazi attack, the Greek withdrawalfrom the Eurozone, and the Incarceration of Daniel Chong. 47 . Analyzing Emerging Entities

EB 1 n = 16,469 (20.7%)EB 2a n = 4,060 (5.1%)EB 2b n = 11,060 (13.9%)LB 1 n = 21,336 (26.8%)LB 2a n = 5,507 (6.9%)LB 2b n = 21,050 (26.5%) Figure 3.8: Signatures of the clusters at Level 2 in Figure 3.5. Solid lines represent thecluster signatures (average of all samples in the cluster), lighter bands represent standarddeviation.48 .5. Results

Table 3.5: Comparison of the entity cluster statistics for clusters at Level 2 in Figure 3.5. proportion duration ( ± std med. mean ± std med. mean ± std med.EB 1 0.21 238 ±

148 223 145 ±

868 28 0.83 ± ±

116 191 74 ± ± ±

150 164 94 ±

363 19 0.65 ± ±

148 230 242 ± ± ±

121 110 155 ±

557 29 1.07 ± ±

158 297 227 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Summary

In this section, we answer our ﬁrst research question: “Can we discover distinct groups ofdifferently emerging entities by clustering their emergence time series?”We performed hierarchical clustering using a burst similarity-metric of the emergingentity time series. We discovered two distinct emergence patterns: early bursting entitiesand late bursting entities. Both the visual inspection of the cluster signatures, as theanalysis of the descriptive statistics of the time series in the clusters support the sameﬁndings: the EB entities are characterized by fewer but higher bursts, with shorteremergence durations and lower emergence volumes. The LB entities on the other hand,seem to emerge more slowly, with a more gradual increase of exposure in the online textstreams, before reaching the point at which the entity is added to the KB. This can be seenboth in the cluster signature, as in the descriptive statistics, which show longer durations,higher volumes, and a larger number of bursts.

In this section, we answer our second research question: “Do news and social mediatext streams exhibit different emergence patterns?” The news or social media documentstreams represent content from different sources: the news stream consists of traditionalonline news sources, where the content is mostly written by professional journalists; thesocial media stream contains mostly user-generated content, and consists of, e.g., forumsand blog posts, but also content that was shared through other social media platforms suchas Twitter and Facebook (through the bit.ly URL-shortener service). Studying whetherdifferent sources surface different entities, and exhibit different emergence patterns, canprovide insights into how the nature of the source may affect the discovery process.In the previous section, we have shown that 79,482 entities emerge in the combined49 . Analyzing Emerging Entities news n = 51,095 social n = 51,356EB (news) n = 25,532 (50.0%)LB (news) n = 25,563 (50.0%) LB (social) n = 26,396 (51.4%)EB (social) n = 24,960 (48.6%) Figure 3.9: News vs. Social stream cluster signatures. The top row shows the globalcluster signature of the news (left) and social (right) streams. The bottom two rows showthe signatures of the late bursting and early bursting entity clusters for each stream (newsleft, social right).news and social media streams. Taking a closer look by splitting out these entities bystream, we ﬁnd 51,095 of these entities emerge in the news stream (i.e., are mentioned inthe news stream), similar to the number of entities that emerge in the social media stream,at 51,356. Finally, 30,148 of the emerging entities are mentioned in both streams beforebeing incorporated into Wikipedia.In order to answer our second research question, we consider (i) the emergencepatterns of entities that emerge in social media and news streams, and (ii) entities thatappear in both streams vs. (iii) entities that appear in only one of the two streams.

Global: News vs. Social

First, we compare the emergence patterns of entities in news and social streams. We applythe same hierarchical clustering method as we have done in Section 3.5.1 on the twosubsets of entities that emerge in news and social media streams (where n news = 51,095and n social = 51,356).Unsurprisingly, the emergence patterns are largely the same in the two streams andhighly similar to the global patterns studied in Section 3.5.1. Both the news and socialstreams exhibit the same general global emergence pattern, witnessed by the largelysimilar clusterings we yield. Both streams exhibit groups that are similar to the earlybursting and late bursting entities discovered in the previous section (shown in Figure 3.7).50 .5. Results Table 3.6: Global time series and bursts statistics per type of text stream.stream duration ( ± std med. mean ± std med. mean ± std med.news 216 ±

147 170 123 ±

579 28 0.75 ± ±

153 211 65 ±

239 19 0.48 ± ± ± ± ± ± ± p -values used in Section 3.5.1) shows that the differencesin the descriptive statistics between both streams are statistically signiﬁcant at the α ≤ . level. With 216 days on average, entities emerge in news streams more quickly thanentities in the social media stream. These shorter emergence durations are seen withhigher emergence volumes on average: an entity that emerges in news is mentionedon average in 123 documents between their initial and ﬁnal mention, nearly double thenumber of documents in social media (65). Recall that the total number of documentsin the social media stream is larger, at 5.3M documents (versus 1.8M documents for thenews stream — see also Figure 3.2). The higher number of documents with comparativelyshorter emergence durations further supports the observation that emerging entities innews are picked up quicker than those that emerge in social media.Furthermore, in the previous section, we have seen how early bursting entities exhibitfewer but higher bursts, and reasoned they represent more “urgent” or timely entities.The emergence features of entities emerging in news supports this notion of timeliness orurgency: they exhibit higher and fewer bursts on average.In summary, we have shown that entities emerging in news and social media showbroadly similar patterns, with both the cluster signatures and descriptive statistics beingsimilar to the emergence patterns in the combined text streams in Section 3.5.1. Wehave also shown that news streams seem to surface entities more quickly than social51 . Analyzing Emerging Entities media streams, which we attribute to the different nature of the streams (professionaland authoritative versus unedited and user-generated). In Section 3.5.3 we revisit thishypothesis by studying how the types of emerging entities differ between streams. Who’s First?

Of the 79,482 entities that emerge in the 18 month period our dataset spans, 30,148 appearin both the news and social media stream; 20,947 entities are mentioned exclusively in thenews stream, never appearing in social media ( news-only ) between surfacing in onlinetext streams and being incorporated into the KB. Finally, 21,208 appear only in the socialmedia stream ( social-only ). See also Table 3.7.Of the 30,148 entities that emerge in both streams, the majority appears in the socialmedia stream before they appear in the news stream. This may be explained by the natureof the publishing cycles of the two streams; whereas traditional newswire has a morethorough publishing cycle, where stories need to be checked and edited before beingpublished, social media — in particular forums and blog posts — follows a more uneditedand direct publishing cycle.The entities that appear in a social media stream ﬁrst, which we denote social-first ,cover 62.9% ( n = 18,967) of the entities that emerge in both streams. The opposite pat-tern, where entities appear in news before they appear in social media ( news-first ),comprises 29.1% of the entities that emerge in both streams ( n = 8,794). Entities thatemerge in news ﬁrst, subsequently appear in social media streams faster than vice versa: ittakes a news-first entity on average 66 days to appear in the social media stream aftersurfacing in the news stream, whereas the other way around takes 49 days. The remainingcomparatively small number of entities are mentioned in both streams on the same day( same-time ): 7.9% ( n = 2,387). The latter group of entities are expected to emergemore quickly, by virtue of appearing more widely in public discourse, and hence beingmore urgent and central.The burst and emergence descriptive statistics, as recorded in Table 3.7, supportthis view: the same-time entities show comparatively short emergence durations.With an average 197 days between being ﬁrst mentioned and being added to Wikipedia, same-time entities emerge substantially quicker than entities that appear in bothstreams at different times (at 298 days and 281 days on average for news-first and social-first entities respectively). Furthermore, the same-time entities ex-hibit higher emergence volumes too, accounting for the highest overall velocity at 2.87documents per day, further supporting the hypothesis that these entities are more urgentand central.Entities that emerge only in one of the two media streams show shorter emergencedurations than those that appear in both but at different times (at 250 and 214 for newsand social media respectively). These observations can be explained by the fact thatlonger duration means that it is more likely for an entity to cross from one stream tothe other. Similarly, the longer average emergence durations of the news-first and social-first entities are paired with a larger number of bursts (around 4 on average,versus 3 on average for entities that emerge in either the news or social media stream).Much like what we saw in the previous section between late bursting and early burstingentities, the shorter durations and smaller number of bursts suggests that entities that52 .5. Results Table 3.7: Time series and burst statistics for entities that emerge in either or bothstreams (ﬁrst). Burst and emergence features for our ﬁve groups of entities: entitiesthat emerge in both streams, but are ﬁrst mentioned in the news stream ( news-first ),entities that emerge in both streams, but are ﬁrst mentioned in the social media stream( social-first ), entities that emerge in both streams, and appear in both on the sameday ( same-time ), entities that emerge only in the news stream ( news-only ), andﬁnally, entities that emerge only in the social media stream ( social-only ).stream duration ( ± std med. mean ± std med. mean ± std med.news ﬁrst 298 ±

139 305 123 ±

291 53 0.58 ± ±

157 276 182 ±

445 74 0.95 ± ±

147 163 192 ±

662 67 2.87 ± ±

152 216 415 ± ± ±

148 190 33 ±

134 12 0.41 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Summary

News and social media streams show broadly similar emergence patterns for entities.However, while the patterns may be similar, the population and the behavior of entitiesemerging in news and social differ signiﬁcantly. More speciﬁcally, entities emerging inthe social media stream seem to do so slower on average than in news. Looking in moredetail at the interactions between the two streams, we notice that entities that appear inboth streams on the same day are the fastest to be incorporated into the KB. Furthermore,we ﬁnd that entities that ﬁrst emerge in social media are more quickly picked up in newsstreams than vice versa.

In this section, we answer our third research question: “Do different types of entitiesexhibit different emergence patterns?” First, we analyze the descriptive statistics of eachentity type in our dataset, to assert whether different types of entities exhibit differentbehavior in terms of how they emerge in online text streams. Next, we study whetherentity types are distributed differently over the news and social media text streams. 53 . Analyzing Emerging Entities

Person n = 21,295 (26.8%) Organization n = 5,606 (7.1%)VideoGame n = 505 (0.6%) Building n = 1,067 (1.3%) Figure 3.10: Type signatures of the Person, Organization, VideoGame and Building types.Even though the number of entities of that type differ substantially, the signatures showroughly similar patterns.

Entity Types: Temporal Patterns

First, we study each entity type in isolation, i.e., we study the descriptive statistics perentity type. Table 3.8 shows all entity types with a frequency of ≥ in our dataset.We ﬁnd that the entity type signatures (i.e., the average over all time series for each entitytype) are highly similar to the global pattern (visualized in Figure 3.6), suggesting that thetime series of mentions are highly variable within an entity type. To illustrate this, seeFigure 3.10 for an example of two common entity types (top row) and two less frequentlyemerging types (bottom row). Whereas the signature becomes smoother as the number ofentities increase, the overall pattern is highly similar across the four types.In contrast, the descriptive statistics per entity type does yield clear patterns, as wewill describe next. First, the null class, i.e., the entities that are not assigned an entitytype in DBpedia exhibit very low emergence volumes, with an average of 98 documentsover 225 days. Which may be explained by their nature, as they are not assigned a class inthe DBpedia ontology, they are likely to be very long-tail, or unpopular entities.Second, we note a group of “fast” emerging entity types, i.e., those with shortemergence durations and/or high emergence velocities, e.g., DesignedArtifact , CreativeWork , MusicalWork , and

VideoGame . In particular, the

Designed-Artifacts type shows high emergence velocities: it takes entities of this type onaverage 217 days to be incorporated into the KB, with an average volume of over 7documents a day (versus 0.87 for the global average). The

DesignedArtifact typeincludes entities such as devices and products, e.g., smartphones, tablets, and laptops. Therelatively fast transition of the entities of this type may be explained by their nature: theyhave short “life-cycles” and may be superseded or replaced at high frequencies. Consider,e.g., the release or announcement of a new smartphone: this event typically generates alot of attention in a short timeframe, which may result in a fast emergence. Similar to54 .5. Results

Table 3.8: Descriptive statistics per entity type (for types that occur ≥ in our dataset). stream n samples duration ( ± std med. mean ± std med. mean ± std med.Person 21,295 270 ±

151 254 243 ±

692 71 1.03 ± ±

150 235 264 ±

674 76 1.05 ± ±

154 210 294 ± ± ±

154 211 294 ± ± ±

153 270 335 ± ± ±

149 273 122 ±

448 33 0.48 ± ±

156 275 462 ± ± ±

148 181 170 ±

533 78 1.13 ± ±

154 247 279 ± ± ±

158 284 210 ±

476 73 0.85 ± ±

150 289 221 ±

393 86 0.95 ± ±

152 302 240 ±

564 75 0.95 ± ±

149 284 133 ±

436 36 0.47 ± ±

145 244 119 ±

481 31 0.53 ± ±

150 290 125 ±

374 34 0.43 ± ±

158 193 228 ±

503 87 1.33 ± ±

147 245 307 ±

864 88 1.26 ± ±

144 308 108 ±

564 30 0.41 ± ±

156 221 732 ± ± ±

142 305 56 ±

158 25 0.29 ± ±

147 263 286 ±

827 93 1.17 ± ±

150 198 381 ±

657 189 2.08 ± ±

149 187 1,420 ± ± ±

146 269 127 ±

560 34 0.47 ± null ±

151 196 98 ±

861 15 0.58 ± devices of the DesignedArtifact -type, creative works (

CreativeWork , including

MusicalWork , WrittenWork , Movie , etc.) share this characteristic: they play acentral but short-lived role in public discourse.Third, the “slower” entities, i.e., those with longer emergence durations and loweremergence volumes, are largely person types such as writers (

Writer ), artists (

Artist ),and political ﬁgures (

OfficeHolder ), but also schools (

School and

Educational-Institution ), and geographical entities (e.g.,

Building , ArchitecturalStruc-ture , Place , and

PopulatedPlace ). These entities by their nature may have longerlife-cycles, with a more gradual “rise to fame” (politicians, artists), and play a less centralrole in public discourse (schools, buildings). The opening of a new school may appearbrieﬂy in regional and local news sources, but is unlikely to be globally and widelyreported. Politicians generally have a long and gradual career, surfacing e.g., in regionalmedia, and do not suddenly “burst” into existence.To better understand the difference between “fast” and “slow” entities, we examine thepopularity of entities. Table 3.9 lists the average number of pageviews received per entityin 2015, grouped by type. Entity types that exhibit short emergence durations and highvelocities are all in the top 10 (ranks 3, 4, and 9, for

VideoGame , CreativeWork , and55 . Analyzing Emerging Entities

Table 3.9: “Popularity,” i.e., average total number of pageviews in 2015 of each entity inour dataset, aggregated per entity type. Ranked in descending order.Rank Type Mean ± std Median1 Movie 98,387 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± DesignedArtifact , respectively), whereas the slower entity types all reside towardsthe lower ranks of the table, e.g., rank 19, 22, and 24 for

Building , Educational-Institution and

School , respectively. This suggests that entity types that emergemore quickly remain more popular over time.

Entity Types per Stream

To determine whether the entity types observed in news and social media streams differsigniﬁcantly, we apply Pearson’s chi-squared test. This allows us to identify entity typesthat are observed more than expected (w.r.t. the global distribution) in the news and socialmedia streams. See Table 3.10.The majority of the entity types observed more frequently in social media streams(right column of Table 3.10) are the ones we identiﬁed to be comparatively fast in emergingin the previous section, e.g.,

DesignedArtifact , VideoGame , TelevisionShow , Software . The (average) popularity rankings from Table 3.9 show how 6 out of 10 ofthe entity types that are observed more frequently in social media are in the top 10 most56 .6. Conclusion

Table 3.10: Entity types ranked by exceeded observed frequency w.r.t. expected frequencyusing chi-grams. The popularity rank from Table 3.9 is shown in brackets.More in news than social More in social than newsPerson (12) DesignedArtifact (9)Athlete (14) VideoGame (3)Organization (15) Infrastructure (20)Place (21) Book (10)InformationEntity (5) TelevisionShow (2)CreativeWork (4) Software (6)Company (16) School (24)PopulatedPlace (23) MusicalWork (13)OfﬁceHolder (17) WrittenWork (11)ArchitecturalStructure (18) Movie (1)popular entity types of Table 3.9. The average popularity rank of the social media entitytypes is 9.9. At the other end, the types of entities seen more frequently in news are bothslower in transition on average (e.g.,

Person , Organization , and

Company ), andare characterized by being more “general” or less niche types compared to the entity typesseen more frequently in the social media stream. Videogame or smartphones are likelyto see more exposure in social media streams in, e.g., blog and forum posts, but moregeneral entity types such as people, places, and organizations are more natural subjects for(traditional) news media. Merely 2 entity types that are more often seen in news streamsare in the top 10 most popular entity types in Table 3.9. The average popularity rank ofnews-speciﬁc entity types is 14.5. These observations suggest that entities that remainpopular over time are more likely to emerge in the social media stream.

Summary

In summary, we have shown that different entity types exhibit substantially differentemergence patterns, but entities that belong to a particular type show broadly similaremergence patterns. Furthermore, we have shown that different entity types are distributeddistinctly over different online text streams, which can be intuitively explained by lookingat both the nature of the entity types and the nature of the streams. Next, we have seenthat entities that emerge fast are more likely to remain popular over time. Finally, we haveseen that in social media streams entities emerge faster, and more remain popular overtime in comparison to news streams.

In this chapter, we studied entities as they emerge in online text streams. We did so bystudying a large set of time series of mentions of entities in online news streams beforethey are added to Wikipedia. We studied implicit groups of similarly emerging entities byapplying a burst-based agglomerative hierarchical clustering method and explicit groups57 . Analyzing Emerging Entities by isolating entities by whether they emerge in news or social media streams. Next,we summarize the ﬁndings, implications and limitations of our study into the nature ofemerging entities in online text streams.

In Section 3.5.1, we applied a clustering method to the time series of mentions of emergingentities to ﬁnd implicit groups of similarly emerging entities. We found that, globally, enti-ties have a long time span between surfacing in online text streams and being incorporatedinto the KB. During this time span, an emerging entity is associated with multiple bursts(i.e., resurfaces into public discourse), however both the emerging entities’ introductioninto public discourse and subsequent incorporated into Wikipedia occur in the largest docu-ment bursts. Emergence durations and volumes show large standard deviations, indicatingthat they differ substantially between entities. For this reason, we turned to time seriesclustering to uncover distinct groups of entities. We discovered two distinct emergencepatterns: early bursting (EB) entities and late bursting (LB) entities. Analysis suggeststhat EB entities comprise mostly “head” or popular entities; they exhibit fewer and higherbursts, with shorter emergence durations and lower emergence volumes. The LB entitiesemerge more slowly on average, and witness a more gradual increase of exposure inonline text streams. The emergence patterns we visualized differ substantially from theglobal average and from, e.g., the type signatures studied in Section 3.5.3, suggesting thatthe entities in each of the underlying clusters exhibit substantially different and distinctemergence patterns from entities in the other clusters.In Section 3.5.2, we showed that entities emerging in news and social media streamsdisplay very similar emergence patterns, but that on average, entities that emerge in socialmedia have a longer period between surfacing and being incorporated into the KB. Wehypothesize that this can be attributed to the nature of the underlying sources. Traditionalnews media is more mainstream and professional, with a larger audience and reach, andmore authority than social media streams. Our ﬁndings are in line with those of Petrovicet al. [207], who compare breaking news on traditional media with that on social media.Their ﬁndings suggest reported events overlap largely between both media, however,social media exhibits in addition a long tail of minor events, which may explain the longeruptake on average. Leskovec et al. [149] ﬁnd that the “attention span” for news events onsocial media both increases and decays at a slower rate than for traditional news sources,which may explain the comparatively slower uptake on social media.Finally, we studied entity types in Section 3.5.3. We showed that different entity typesexhibit substantially different patterns, but entities of a similar type show similar patterns.Some entity types, e.g., devices or creative works, on average emerge faster than entitiessuch as buildings, locations, and people. At the same time, the former “faster” entity typesremain more popular over time (as seen through their pageview counts). One aspect thatdistinguishes between “fast” and “slow” entity types, is that the former are more likely toappear in so-called “soft news” (i.e., news that covers sensational or human-interest eventsand topics, e.g., news related to celebrities and cultural artifacts), whereas the slowerentity types are more likely to be associated with more substantive “hard news” (i.e.,news that encompasses more pressing or urgent events and topics, e.g., reports related topolitical elections) [246]. Granka [96] studied the differences in “attention span” of the58 .6. Conclusion public (as measured through search engine query volumes) and the traditional news media(as measured through coverage volume) for “hard” and “soft news,” and found that, inline with the ﬁndings of Leskovec et al. [149], hard news is associated with a relativelyshort period of attention from the public (as measured by query volume). Soft newsexhibits a slower decrease of the public’s attention (as seen through slower declines inquery volumes), which supports our ﬁnding that faster entity types — entities more likelyto be associated with soft news — tend to remain more popular over time. Furthermore,the relatively longer attention for soft news may explain the quicker uptake of the “fast”entity types, e.g., “cultural” artifacts (e.g., movies, TV shows, artists) may emerge morequickly, as they are more widely supported, followed, and more strongly represented inour online public discourse.Finally, we showed how entity types are distributed differently over news and socialmedia streams. This difference in entity types may be explained by the nature of thestreams. Partly because of the open, democratic, and user-generated nature of the Web2.0, and blogging in particular [178], blogs no longer simply pick up news stories fromthe traditional media. The agenda setting power of traditional media is diluting [179].Moreover, there are situations in which breaking news emerges on social media, e.g., thedeath of Osama Bin Laden [126]. And ﬁnally, as mentioned before, the events reportedon social media and traditional news overlap, but social media has a long tail of its “own”events [207], which may account for the different distribution of emerging entity types inthe social media stream.Taking a step back, our ﬁndings can be summarized to the observation that emergingentities are not “born equal,” i.e., the patterns and features under which an entity emergesdiffer depending on source and type.

The ﬁndings in this chapter have implications for designing systems to detect emergingentities, and more generally for studying and understanding how entities emerge in publicdiscourse before they are deemed important enough to be incorporated in the KB. Here, welist observations and ﬁndings that have implications for the discovery process of emergingentities. First, we have shown that entities are likely to resurface multiple times in publicdiscourse (i.e., online text streams) before being incorporated into the KB. This meansthat, on average, it takes multiple moments of “exposure” for an entity to emerge. Thissuggests that monitoring bursts of newly emerging entities could serve as an effectivemethod for predicting when an entity is about to be incorporated into the KB, i.e., afterobserving the initial burst , it is not too late. Furthermore, we have shown that the typeof stream in which entities emerge (i.e., news and social media) carries different signalsthat allow us to model the process of the process of entity emergence. More speciﬁcally,we have shown how entities that emerge, e.g., in both streams at the same time tendto be added to the KB comparatively quicker than, e.g., those entities that never movefrom one stream to another. These ﬁndings suggest that taking different streams intoaccount separately can be beneﬁcial for detecting emerging entities. Finally, we haveshown that the different types of online text streams also surface different types of entities.This provides insights into the difference between “mainstream media” and users of theworld-wide web, in terms of preferences and interests through the “agenda setting.” 59 . Analyzing Emerging Entities

The work presented in this chapter also knows several limitations.

Clusters.

Part of our ﬁndings are derived from the clusters that serve as a startingpoint for discovering common patterns, similarities, and differences in emergence patterns.Clustering and studying cluster signatures is by design a subjective matter [254]. Applyingunsupervised clustering, entailing different hyperparameters, is by deﬁnition a hard task to“evaluate,” i.e., to decide whether the resulting clusters show meaningful differences. Thelarge (standard) deviations seen in the descriptive statistics within clusters may suggestthere exists a wider variety of different entities. However, the linear interpolation step wetook to cluster time series with variable lengths and different relative timespans (i.e., nottemporally aligned) will result in varied time series. In our defense, the cluster signaturesthat result from the clustering method yielded visually discernible and statistically differentpatterns between clusters, which was not the case for the signatures of the groups of timeseries from Sections 3.5.2 and 3.5.3 (see, e.g., Figure 3.10). Finally, the clustering’sresulting dendrogram suggests there are distinct and meaningful groups of substantiallydifferent time series, as the structure of the dendrogram shows symmetry and clearseparations.

Data.

Next, the fragmented nature of the dataset that serves as the starting point of ourstudy, the TREC-KBA StreamCorpus 2014, means the coverage, and hence representa-tiveness and comprehensiveness of the data cannot be guaranteed. As could be seen inFigure 3.2, the dataset contains a blind spot around May 2012, which may affect the timeseries of all emerging entities. To minimize the adverse effects, we normalize all entitydocument mention time series by the total document volume. Furthermore, the choiceof underlying sources (that represent the social media and news streams) is limited, e.g.,popular social media channels such as Tumblr, Twitter and Facebook are not part of thedataset. There may be sampling bias in the types of sources, resulting in a similar bias inthe entities. That is to say, with another set of sources, we may have had different ﬁndings.This is unavoidable.The entity annotations that represent the starting point of our study into emergingentities cannot be assumed to be 100% accurate. So-called “cascading errors” [85] causethe overall accuracy to suffer, by using imperfect tagging by SERIF as input for imperfectFAKBA1 annotations. The latter annotations are estimated (from manual inspection)to contain around 9% incorrectly linked Freebase entities, with around 8% of SERIFmentions being wrongfully not linked. Even more so, the “difﬁcult” entity links are long-tail entities, ones that are likely to be included in our ﬁltered set, meaning the accuracymay be relatively worse in our subset of entities. However, manually correcting theannotations was beyond the scope of this study, and the large scale of the dataset makes itless likely that wrongfully linked entities are a major issue.Finally, there may be a cultural bias inherent in our choice of datasets: we used Englishlanguage news sources and social media as well as the English version of Wikipedia.As an illustration to the cultural bias inherent to the data sources, see Table 3.11 for thetop 10 most frequently mentioned entities in the FAKBA1 dataset. Hence, one couldclaim that we studied the emergence patterns of entities for the

English speaking part60 .6. Conclusion

Table 3.11: Top 10 most frequently occurring entities in the FAKBA1 dataset.1

United States

United Kingdom

Barack Obama

China

Yahoo!

Facebook

New York City

India

Europe

Canada

Following our study into the nature of entities as they emerge in online text streams,and understanding that knowledge bases are dynamic in nature, in the next two chapterswe propose automated methods for discovering and searching emerging entities. Morespeicﬁcally, in the next chapter we address the task of predicting emerging entities insocial streams (Chapter 4). Next, we turn to improving their retrieval effectiveness, byconstructing dynamic entity representations. More speciﬁcally, the dynamic represen-tations aim to capture the different and dynamic ways in which people may refer to orsearch for entities (Chapter 5). 61

Predicting Emerging Entities “Nec scire fas est omnia.” —Horace, Carmina, IV.

Entity linking owes a large part of its success to the extensive coverage of today’s knowl-edge bases; they span the majority of popular and well established entities and concepts.For most domains this broad coverage is sufﬁcient. However, it does not provide a solidbasis in domains that refer to “long-tail” entities, such as the E-Discovery domain, orin domains where new entities are constantly born, e.g., in the news and social mediadomain. As we have seen in the previous chapter, entities may emerge (and sometimesdisappear) before editors of a knowledge base reach consensus on whether an entity shouldbe included in the knowledge base. Knowledge bases are never complete: new entitiesmay emerge as events unfold, but at the same time, long-tail, relatively unknown entitiesmay also be added to a KB.As a follow-up to Chapter 3, in this chapter we turn our attention to predicting newlyemerging entities, i.e., to discover entities that appear in social media streams before theyare incorporated into the KB. Identifying newly emerging entities that will be incorporatedin a knowledge base is important for knowledge base construction, population, and accel-eration, and ﬁnds applications in complex ﬁltering tasks and search scenarios, where usersare not just interested in ﬁnding any entity, but also in entity attributes like impact, salience,or importance. Identifying emerging entities is closely related to named-entity recognitionand classiﬁcation (NERC), and named-entity normalization (NEN) or disambiguation,with the additional constraint that an entity should have “impact” or be important enoughto be incorporated in the knowledge base. Although impact or importance are hard tomodel because they depend on the context of a task or domain, we argue that entitiesthat are included in a knowledge base are more important than those that are not, and usethis signal for modeling the importance of an entity. Through this approach, our methodleverages prior knowledge (as encoded in the knowledge base) for discovering newlyemerging entities.Named-entity recognition is a natural approach for identifying these newly emergingentities that are not in the knowledge base. However, current models fall short as they do63 . Predicting Emerging Entities not account for the aforementioned importance or impact of the entity. In this chapter,we present an unsupervised method for generating pseudo-ground truth for training anamed-entity recognizer to steer its predictions towards newly emerging entities. Ourmethod is applicable to any trainable model for named-entity recognition. In addition, ourmethod is not restricted to a particular class of entities, but can be trained to predict anytype of entity that is in the knowledge base.The challenge of discovering newly emerging entities is two-fold: (i) the ﬁrst challengeis how to model the attribute of importance, and (ii) the second is the streaming anddynamic nature of the task, where both the input data (social media) and the updates inthe knowledge base (i.e., when new entities are added) come in a stream; content thatis eligible for addition in Wikipedia today, may no longer be in the future [273], whichrenders static training annotations unusable.Our approach to discovering newly emerging entities in social streams answers bothchallenges. For the ﬁrst challenge, we carefully craft the training set of a named-entityrecognizer to steer it towards identifying newly emerging entities. That is, we leverageprior knowledge of the entities already in the knowledge base, to identify new entitiesthat are likely to share the same attributes, and thus be candidates for inclusion in theknowledge base. Just as a named-entity recognizer trained solely on English person-typeentities will recognize only such entities, a named-entity recognizer trained on knowledgebase entities can be expected to recognize only this type of entity. For the second challenge,we provide an unsupervised method for generating pseudo-ground truth from the inputstream. With this automated method we are not dependent on human annotations thatare necessarily limited and domain and language speciﬁc, and newly added knowledgewill be automatically included. We focus on social media streams because of the fastpaced evolution of content and its unedited nature, which make it a challenging setting forpredicting which entities will feature in a knowledge base. The main research questionwe seek to answer in this chapter is:

RQ2

Can we leverage prior knowledge of entities of interest to bootstrap the discoveryof new entities of interest?To answer this question, we propose our novel method, and formulate two sub questions.The ﬁrst sub question aims to study the challenge of the noisy, unedited nature of socialmedia streams. In order to generate high quality training data for our named-entityrecognizer, we propose and experiment with two sampling methods, and answer thefollowing subquestion:RQ2.1 What is the utility of our sampling methods for generating pseudo-ground truthfor a named-entity recognizer?We measure utility within the task of predicting new knowledge base entities from socialstreams as the prediction effectiveness of a named-entity recognizer trained using ourmethod. Next, we study the impact of the amount of prior knowledge on the effectivenessof identifying newly emerging entities. Our second subquestion is:RQ2.2 What is the impact of the size of prior knowledge on predicting new knowledgebase entities?64 .2. Approach

InputDocument

EL m , c m , c NERCModel m , e m , e ? TrainNERC SampleCorpus

Unlinked sentenceLinked sentence Pseudo-ground truth m , c m , c … Predictions

Document StreamDocument StreamDocument StreamDocument StreamDocument StreamDocument Stream

Figure 4.1: Our approach for generating pseudo-ground truth for training a NERC method,and for predicting emerging entities . We view the task of identifying new KB entities in social streams as a combination of an entity linking (EL) problem and a named-entity recognition and classiﬁcation (NERC)problem. We visualize our method in Figure 4.1. Starting from a document (social mediapost) in a document stream (social media stream), we extract sentences, and use an ELsystem to identify referent KB entities in each sentence. If any is identiﬁed, the sentenceis pooled as a candidate training example for our NERC method (we refer to this type ofsentence as a linkable sentence ), otherwise it is routed to our NERC method for identifyingnewly emerging entities ( unlinkable sentences ): an underlying assumption behind ourmethod is that the ﬁrst place to look for emerging entities is the set of unlinkable sentences.Most of our attention in this chapter is devoted to training the NERC method. Twoideas are important here. First, we extend the distributional hypothesis [86] (i.e., wordsthat occur in the same contexts tend to have similar meaning) from words to entities; wehypothesize that emerging entities that should be included in the knowledge base occur insimilar contexts as current knowledge base entities. Second, we apply EL on the inputstream and transform its output into pseudo-ground truth for NERC; this results in anunsupervised way of generating pseudo-ground truth, with the ﬂexibility of choosing anytype of entity or concept described in the KB.

We start with the output of an entity linking method [176, 199], on a sentence of adocument in the stream. This output is our source for generating training material. Theoutput consists of tuples of entity mentions and entities ( m, e pairs in Figure 4.1).Since we are allowed to use generic corpora from any domain, e.g., news, microblogposts, emails, chat logs, we may expect to have noise in our pseudo-ground truth. Weapply various sampling methods to select sentences that make up a high quality trainingcorpus. These sampling methods are described in Section 4.4. http://semanticize.uva.nl . Predicting Emerging Entities After sampling, we convert the remaining sentences in a format suitable for input for ourNERC method. This format consists of the entity span ; the sequence of tokens that refersto an entity, i.e. the entity mention, and entity class for each linked entity ( m, c pairs inFigure 4.1). To denote the entity span, we apply the BIO-tagging scheme [213], whereeach token is tagged with whether it is the B eginning of an entity mention, I nside an entitymention, or O utside of an entity mention, so that a document like:Kendrick Lamar and A$AP Rocky. That’s when I started listening again.Thanks to Brendan.becomes:Kendrick B Lamar I and O A$AP B Rocky I . That’s O when O I O started O listening O again O . Thanks O to O Brendan B .The ﬁnal step is to assign a class label to an entity. As not all knowledge bases associateclasses to their entities, we use DBpedia for looking up the entity and extracting theentity’s DBpedia ontology class, if any; see Section 4.5 for details. Our example thenbecomes:Kendrick B - P ER

Lamar I - P ER and O A$AP B - P ER

Rocky I - P ER . That’s O when O I O started O listening O again O . Thanks O to O Brendan B - P ER .Now we can proceed and train our NERC method with our generated pseudo-ground truth.We do so using a two-stage approach [34], where both the ﬁrst stage of recognizing theentity span, and the second stage of classifying the entity type, are implemented using thefast structured perceptron algorithm [45], which treats both stages as a sequence labelingproblem. To craft a high quality set of pseudo-ground truth for training our NERC method, inthis section, we present two sampling methods: (a) sampling based on the entity linkingsystem’s conﬁdence score for a linked entity, and (b) sampling based on the textual qualityof an input document.

Typically, entity linking systems provide conﬁdence scores for each entity mention ( n -gram) they are able to link to a knowledge base entity. These conﬁdence scores can beused to rank possible entities for a mention, but also for pruning out mention-entity pairsfor which the linker is not conﬁdent. Although the scale of the conﬁdence score dependson the model behind the entity linking system, the scores can be normalized over thecandidates for an entity mention, e.g., using linear or z-score normalization. We use the “ SenseP robability (cid:48)(cid:48) metric introduced by Odijk et al. [199] as our conﬁdence score. https://github.com/larsmans/seqlearn .4. Sampling Pseudo-ground Truth Table 4.1: Features used for sampling documents from which we train a NERC system.

Feature Description Feature Description n mentions No. of usernames (@) avg token len Average token lengthn hashtags No. of hashtags ( n -gram being used as ananchor text pointing to a speciﬁc entity e , with the prior probability of the n -gram n beingused as an anchor at all (als known as the commonness -score [182]. See equation 4.1. SenseP robability ( n, e ) = | L n,e | (cid:80) e (cid:48) ∈ KB c ( n, e (cid:48) ) , (4.1)where L n,e denotes the set of all links with n -gram n and target entity e , n denotes the n -gram to be linked (i.e., anchor text), e denotes the candidate entity, c ( n, e ) denotes thenumber of times that n -gram n links to candidate entity e . Taking the textual quality of content into account has proved helpful in a range of tasks.Based on [170, 258], we consider nine features indicative of textual quality; see Table 4.1.While not exhaustive, our feature set is primarily aimed at social streams as our targetdocument stream (see Section 4.5) and sufﬁces for providing evidence on whether thistype of sampling is helpful for our purposes. Based on these features, we compute a ﬁnalscore for each document d as score( d ) = 1 | F | (cid:88) f ∈ F f ( d )max f , (4.2)where F is our set of feature functions (Table 4.1) and max f is the maximum value of f we have seen so far in the stream of documents. Since all features are normalized in [0 , , score( d ) has this same range. As a qualitative check, we rank documents fromthe MSM2013 [19] dataset using our quality sampling method and list the top-5 andbottom-5 scoring documents in Table 4.2. Top scoring documents are longer and denserin information than low scoring documents. We assume that these documents are betterexamples for training a NERC system.In the next section, we follow a linear search approach to sampling training examplesas input for NERC. First, we ﬁnd an optimal threshold for conﬁdence scores, and ﬁx it. Forsampling based on textual quality, we turn to the MSM2013 dataset to determine samplingthresholds. We calculate the scores for each tweet, and scale them to fall between [0,1].We then plot the distribution of scores, and bin this distribution in three parts: tweets67 . Predicting Emerging Entities Table 4.2: Ranking of documents in the MSM2013 dataset based on our quality samplingmethod. Top ranking documents appear longer and denser in information than low rankingdocuments.

Top-5 quality documents “Watching the History channel, Hitler’s Family. Hitler hid his true family heritage,while others had to measure up to Aryan purity.”“When you sense yourself becoming negative, stop and consider what it would meanto apply that negative energy in the opposite direction.”“So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and thenat 8, cricket training. Wednesday, Drama. Thursday ... (c)”These late spectacles were about as representative of the real West as porn movies areof the pizza delivery business Que LOLSudan’s split and emergence of an independent nation has politico-strategic signiﬁ-cance. No African watcher should ignore this.

Bottom-5 quality documents

Toni Braxton ˜ He Wasnt Man Enough for Me HASHTAG HASHTAG ? URL RTMention“tell me what u think The GetMore Girls, Part One URL ”this girl better not go off on me rt“you done know its funky! – Bill Withers” “Kissing My Love”” URL via Mention ”This is great: URL via URL that fall within a single standard deviation of the mean are considered normal , tweets tothe left of this bin are considered noisy , whilst the remaining tweets to the right of thedistribution are considered nice . We repeat this process for our tweet corpus, using the binthresholds gleaned from the MSM2013 set.

In addressing the problem of predicting emerging entities in social media streams, weconcentrate on developing an unsupervised method for generating pseudo-ground truthfor a NERC method, and predicting newly emerging entities. In particular, we want toknow the effectiveness of our unsupervised pseudo-ground truth (UPGT) method over arandom baseline and a lexical matching baseline, and the impact on effectiveness of ourtwo sampling methods. To answer these questions, we conduct both optimization andprediction experiments.68 .5. Experimental Setup

As our document stream, we use tweets from the TREC 2011 Microblog dataset [233],a collection of 4,832,838 unique English Twitter posts. This choice is motivated bythe unedited and noisy nature of tweets, which can be challenging for prediction. Ourknowledge base (KB) is a subset of Wikipedia from January 4, 2012, restricted to entitiesthat correspond to the NERC classes person (PER), location (LOC), or organization(ORG). We use DBpedia to perform selection, and map the DBpedia classes

Organisation , Company , and

Non-ProﬁtOrganisation to ORG,

Place , PopulatedPlace , City , and

Country to LOC, and

Person to PER. Our ﬁnal KB contains 1,530,501 entities.

To study the utility of our sampling methods, we turn to the impact of setting a thresholdon the entity linking system’s conﬁdence score (Experiment Ia.) and the effectiveness ofour textual quality sampling (Experiment Ib.).In Experiment Ia., we perform a sweep over thresholds between 0 and 1, in steps of 0.1,using the same threshold for both the generation of pseudo-ground truth and evaluatingthe prediction effectiveness of emerging entities. Lower thresholds allow low conﬁdenceentities in the pseudo-ground truth, and likely generate more data at the expense of noisyoutput. We emphasize that we are not interested in the correlation between noise andconﬁdence score, but rather in the performance of ﬁnding emerging entities given the entitylinking system’s conﬁguration. In Experiment Ib., we compare our methods performancewith differently sampled pseudo-ground truths, i.e., by sampling tweets that belong toeither the nice , normal , or noisy bins, following our textual quality sampling method (cf.Section 4.4.2). To answer our second research question, and study the impact of prior knowledge ondetecting emerging entities, we compare the performance of our method (UPGT) to twobaselines: a random baseline (RB) that extracts all n -grams from test tweets and considersthem emerging entities, and a lexical-matching baseline (NB) that follows our approach,but generates pseudo-ground truth by applying lexical matching of entity titles, instead ofan EL system, and refrains from sampling based on textual quality. For this experiment,we use the optimal sampling parameters for generating pseudo-ground truth from ourprevious experiment, i.e., include linked entities with a conﬁdence score higher than 0.7,and use only normal tweets. As we will show in Section 4.6, the threshold of 0.7 balancesperformance with high recall of entities in the pseudo-ground truth. We evaluate the quality of the generated pseudo-ground truth on the effectiveness of aNERC system trained to predict newly emerging entities. As measuring the addition of newentities to the knowledge base is non-trivial, we consider a retrospective scenario: Given http://mappings.dbpedia.org/server/ontology/classes/ . Predicting Emerging Entities our KB, we randomly sample entities to yield a smaller KB (KB s ). This KB s simulates theavailable knowledge at the present point in time, whilst the full KB represents the futurestate. By measuring how many entities we are able to detect in our corpus that featurein KB, but not KB s , we can approximate the newly emerging entity prediction task. Wecreate KB s by taking random samples of 20–90% the size of KB (measured in number ofentities), in steps of 10%. We repeat each sampling step ten times to avoid bias.For each KB s , we generate pseudo-ground truth for training and test sets for evaluation.We use KB s to link the corpus of tweets, and yield two sets of tweets: (a) tweets withlinked entities, (b) tweets that may contain emerging entities, analog to the linked and unlinked sentences from Figure 4.1. The size of these two sets depends on the size of KB s :a smaller KB will yield a larger set of unlinked tweets. This makes comparisons of resultsacross different KB s difﬁcult. We cater for this bias by randomly sampling 10,000 tweetsfrom both the test set and the pseudo-ground truth, and repeating our experiments tentimes. Using the smallest KB s (20%) results in about 15,000 tweets in the pseudo-groundtruth. Ground truth is assembled by linking the corpus of tweets using the full KB. Theground truth consists of 82,305 tweets, with 12,488 unique entities.We evaluate the effectiveness of our method in two ways: (a) the ability of NERC togeneralize from our pseudo-ground truth, and (b) the accuracy of our predictions.For the ﬁrst, we compare the predicted entity mentions to those in our ground truth,akin to traditional NERC evaluation. For the second we take the set of correctly predictedentity mentions (true positives), and link them to their referent entities in the groundtruth. This allows us to measure what we are actually interested in: the fraction of newlydiscovered entities. For both types of evaluation we report on average precision and recallover 100 runs per KB s . Statistical signiﬁcance is tested using a two-tailed paired t-testand is marked as (cid:78) for signiﬁcant differences for α = . . Our ﬁrst experiment aims to answer RQ2.1:

RQ2.1

What is the utility of our sampling methods for generating pseudo-ground truthfor a named entity recognizer?For this experiment, we ﬁx the size of the KB s at 50%. We start by looking at the abilityof NERC to generalize from our pseudo-ground truth, measured on two aspects: (i)effectiveness for identifying mentions of newly emerging entities (i.e., entity mentions ),and (ii) predicting newly emerging entities (i.e., entities ).In Experiment Ia., we look at our conﬁdence-based sampling method. For identifyingmentions of newly emerging entities, we ﬁnd that effectiveness peaks at 0.1 conﬁdencethreshold with a precision of 38.84%, dips at 0.2 and slowly picks up to 35.75% asthreshold increases (Figure 4.2, left). For identifying the newly emerging entities thatare referred to by the mentions, effectiveness positively correlates with the threshold.Effectiveness peaks at the 0.8 conﬁdence threshold, statistically signiﬁcantly differentfrom 0.7 but not from 0.9 (Figure 4.2, right).70 .6. Results !" !" Figure 4.2:

Experiment Ia.

Impact of conﬁdence score on UPGT. Effectiveness ofidentifying entity mentions is shown in the left plot, effectiveness of identifying newlyemerging entities in the right. Threshold set on the conﬁdence score on the x-axis.Precision (solid line) and recall (dotted) are shown on the y-axis.Table 4.3: Number of predicted emerging entity mentions (P) per threshold on theconﬁdence score (T). GT = Ground truth.

T 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9P GT Interestingly, besides precision, recall also shows a positive correlation with thresholds.This suggests that in identifying newly emerging entities, missing training labels are likelyto have less impact on performance than generating incorrect, noisy labels. This is aninteresting ﬁnding as it sets the emerging entity prediction-task apart from traditionalNERC, where low recall due to incomplete labeling is a well-understood challenge.Next, we turn to the characteristics of the pseudo-ground truth that results for each ofthese thresholds, and provide an analysis of their potential impact on effectiveness. Weﬁnd that more data through a larger pseudo-ground truth allows NERC to better generalizeand predict a larger number of emerging entities. This claim is supported by the numberof predicted entity mentions per threshold in Table 4.3. We ﬁnd a similar trend as in theprecision and recall graph above: the number of predicted emerging entities peaks forthe threshold at 0.1 (6,653 entity mentions), and drops between 0.2 and 0.4, and picksup again from 0.5 reaching another local maximum at 0.8. The increasing number ofpredicted emerging entity mentions with stricter thresholds indicates that the NERC modelis more successful in learning patterns for separating entities from noisy labels. This maybe due to the entity linking system linking only those entities it is most conﬁdent about,providing a clearer training signal for NERC.For the rest of our experiments we use a threshold of 0.7 on conﬁdence score because71 . Predicting Emerging Entities

Table 4.4:

Experiment Ib.

Precision and recall for three sampling strategies based ontextual quality of documents: nice, normal, normal+nice (mixed). We also report oneffectiveness of not using sampling by textual quality for reference. Boldface indicatesbest performance. Statistical signiﬁcance is tested against the previous sampling method,e.g., nice to normal.

Mention EntitySampling Precision Recall Precision Recall

No sampling 34.26 ± ± ± ± ± (cid:78) ± (cid:78) ± (cid:78) ± (cid:78) Normal 66.09 ± (cid:78) ± (cid:78) ± (cid:78) ± (cid:78) Nice ± (cid:78) ± ± (cid:78) ± normal bin (i), tweets that fall in the nice bin (ii),or tweets that fall in both the normal and nice bins (iii); see Section 4.4. For reference, wealso report on the performance achieved when no textual quality sampling is used. Wekeep KB S ﬁxed at 50%, and use 0.7 for conﬁdence threshold.Textual quality-based sampling turns out to be twice as effective as no sampling onboth identifying mentions of emerging entities, and identifying newly emerging entities.Among our sampling strategies, nice proves to be the most effective with a precision of70.36% for entity mention identiﬁcation. In emerging entity prediction, the performanceof nice and normal strategies hovers around the same levels. In terms of recall, niceand normal methods are on par, outperforming both other strategies. The success ofnice and normal sampling methods can be attributed to the fact that a more coherent andhomogeneous training corpus allows the NERC model to more easily learn patterns. Next, we seek to study the impact of the size of the prior knowledge that is available toour emerging entity prediction method. We do so by answering RQ2.2:

RQ2.2

What is the impact of the size of prior knowledge on predicting newly emergingentities?We use the optimal combination of our sampling methods from the previous experiments,i.e., a conﬁdence threshold of 0.7, and the normal textual quality sampling. We again lookat the effectiveness of our methods in both identifying emerging mentions, and emergingentities.Figure 4.3 shows the effectiveness of our methods as a function of the size of theknowledge base (i.e., size of KB s ). For identifying mentions of emerging entities, ourmethod (UPGT, blue line) constantly and statistically signiﬁcantly outperforms bothlexical (red line) and random baselines (not shown). In terms of recall, the lexical baseline72 .6. Results !" !" Figure 4.3: Our method (UPGT, blue line) versus the lexical baseline (red) for bothidentifying mentions of emerging entities (left) and newly emerging entity detection(right). Knowledge base size is on the x -axis, and precision (solid lines) and recall (dottedlines) are marked on the y -axis.is on par with UPGT for KB sizes up to 30%. The random baseline shows very lowprecision for both identifying mentions of emerging entities, and predicting emergingentities over all KB s , but almost perfect recall—which is expected given that it assignsall possible n -grams as emerging entity mentions and newly emerging entities (0.69%precision and 65% recall for entity mention identiﬁcation, 1.82% precision and 94.95%recall for emerging entity prediction).Next, we take a closer look at the results. The lexical baseline’s recall increasesslightly when more prior knowledge is added to the knowledge base. This is expectedbehavior because, as we saw in our previous experiment, the pseudo-ground truth getsmore diverse labels, and helps the NERC to generalize. The number of unique entitiesin the pseudo-ground truth shows that the number increases with the size of KB s . UPGTassigns labels to 2,500 unique entities at 20% KB s , which tops at 11,000 unique entities at90%. These numbers are lower for the lexical baseline (1,800 at 20% KB s , and 7,000 at90% KB s ). However, for both methods, the number of entities in the ground truth staysaround the same. The gradual improvement in precision and recall of UPGT for increasingKB s can be attributed to a broader coverage for labeling (observed through looking at theprior entities), and the main distinction between the lexical baseline and UPGT: stricterlabeling through leveraging the entity linking system’s conﬁdence score.Finally, to better understand the performance of our method, we have looked at theset of correct predictions (emerging entities), and false positives, or incorrectly identiﬁedemerging entities. Our analysis revealed examples of actual emerging entities, i.e., entitiesthat were not included in the initial KB, but did exist in the Wikipedia dump. This maybe attributed due to, e.g., missing DBpedia class labels, which highlights the challengingsetting of evaluating the task.On the whole, however, our method is able to deal with missing labels and incomplete73 . Predicting Emerging Entities data, as observed through its consistent and stable precision, justifying our assumptionthat data is incomplete by design. In this chapter, we tackled the problem of predicting emerging entities in social streams,and presented an effective method of leveraging prior knowledge to bootstrap the discoveryof new entities of interest. We set out to answer the following question:

RQ2

Can we leverage prior knowledge of entities of interest to bootstrap the discoveryof new entities of interest?We presented an unsupervised method for generating pseudo-ground truth using an ELmethod with a reference KB. The pseudo-ground truth serves as training data for a NERCmethod for detecting emerging entities that are likely to be incorporated in a KB. Toanswer RQ2 we formulated and answered two subquestions. In Section 4.6.1 we answerour ﬁrst subquestion :

RQ2.1

What is the utility of our sampling methods for generating pseudo-ground truthfor a named-entity recognizer?We do so by introducing and studying two different sampling methods. The ﬁrst samplingmethod is based on the entity linking system’s conﬁdence score, where we hypothesizethat keeping only high-conﬁdence links will yield less training data but of higher quality,resulting in better predictions. The second sampling method is based on the textual qualityof the input documents. We ﬁnd that sampling by textual quality improves performanceof NERC and consequently our method’s performance in predicting emerging entities. Assetting a higher threshold on the entity linking system’s conﬁdence score for generatingpseudo-ground truth results in fewer labels but better performance, we show that theNERC is better able to separate noise from entities that are worth including in a KB.The entity linker’s conﬁdence score is an effective signal for this separation. Both oursampling methods signiﬁcantly improve emerging entity prediction.Next, in Section 4.6.2, we answer our second subquestion:

RQ2.2

What is the impact of the size of prior knowledge on predicting new KB entities?We do so by studying the impact of differently sized (seed) knowledge bases, i.e., differentamounts of prior knowledge on the entities of interest. We ﬁnd that in the case of a smallamount of prior knowledge, i.e., limited size of the available initial knowledge, our methodis able to cope with missing labels and incomplete data, as observed through its consistentand stable precision. This ﬁnding justiﬁes our proposed method that assumes incompletedata by design. Furthermore, this ﬁnding suggests the scenario of an increasing rate ofemerging entity prediction, as more data is fed back to the KB. Additionally, we foundthat a larger number of entities in the KB allows for setting a desirable stricter thresholdon the conﬁdence scores, and leads to improvements in both precision and recall. Thisﬁnding suggests an adaptive threshold that takes prior knowledge into account could proveeffective.74 .7. Conclusion

In summary, we have shown that we can effectively leverage prior knowledge (througha reference KB) to detect similar entities that are not in the KB. An implication of thisﬁnding in the discovery context is that this method allows to effectively support theexploratory search process, as it successfully identiﬁes entities that are similar to a set ofseed entities of interest.Our work has several limitations. We studied the emerging entity prediction task in astreaming scenario. However, due to limitations of the availability of suitable datasets,we employ a retrospective scenario in Experiment II (Section 4.5.3), where we randomlysample entities from the KB for measuring the impact of the amount of prior knowledgeon prediction performance. Taking this retrospective scenario, makes it impossible tomeasure the impact of novelty of entities as they emerge. By repeating each experimentten-fold, we alleviate the problem of keeping or removing popular head entities thatimpact performance more than tail entities, but a more realistic scenario, e.g., using acomprehensive archive of tweets, spanning multiple months, would allow us to sampleKB by time. This scenario, similar to the scenario we explored in Chapter 3 would be ableto provide more insights into the relation between social media dynamics and predictionaccuracy of emerging entities.A natural extension to the work presented in this chapter would be to include a subse-quent entity clustering or disambiguation step, and ultimately to “feed back” emergingentities to the KB. One approach could be to adapt the work of creating keyphrase-basedrepresentation as seed representations [120]. However, as we have seen in Chapter 3,entities are not static, and may appear in sudden bursts of documents, e.g., through apeak in interest as real-world events unfold. In the next chapter, we address this dynamicnature of entities, in the context of entity retrieval. We propose a method for dynamicallyconstructing entity representations by leveraging the way in which people refer to entitiesonline, to improve the retrieval effectiveness of entities. 75

Retrieving Emerging Entities “En wat zijn mijn woorden waard als ik ze niet meer weeg?”—

Sticks, Waar Wacht Je Op?

In the third and ﬁnal chapter of this ﬁrst part of the thesis we focus on retrieving entities ofinterest from the knowledge base. As entities play a central role in exploratory search andanswering the 5 W’s [268], ranking entities to user-issued search queries is an important,but challenging task. In this chapter, we study entity ranking , where the goal is to positiona relevant entity from the knowledge base at the top of the ranking for a given query.We have seen in Chapter 3 that entities may suddenly appear in public discourse,e.g., through events as they unfold in the real world. Due to the dynamic nature ofentities in public discourse, the way in which people address or refer to entities maysuddenly change, providing challenges for traditional retrieval methods that rely on staticrepresentations. Here, we address the dynamic nature of entities in the general web searchscenario, in which a searcher enters a query that can be satisﬁed by returning an entity fromWikipedia. We propose a novel method that collects information from external sourcesand dynamically constructs entity representations, by optimally combining different entitydescriptions from external sources into a single entity representation. The method learnsdirectly from users’ past interactions (i.e., searches) to adapt the entity representationtowards improved retrieval effectiveness of entities.The general web search engine scenario is motivated by the fact that many queriesissued to general web search engines are related to entities [145]. Entity ranking istherefore becoming an ever more important task [13, 14, 57–59]. Entity ranking isinherently difﬁcult due to the potential mismatch between the entity’s description in aknowledge base and the way people refer to the same entity when searching for it. Whenwe look at how entities are described, two aspects, context and time, are of particularinterest and pose challenges to any solution to entity ranking. Here we explain bothaspects.

Context dependency:

Consider the entity

Germany . A history student could expect thisentity to show up when searching for entities related to World War II. In contrast, asports fan searching for World Cup 2014 soccer results is also expecting to ﬁnd the77 . Retrieving Emerging Entities

Figure 5.1: Entity description of

Anthropornis in Wikipedia and a tweet with analternative description of the same entity.same entity. The challenge then becomes how to capture these different contextsfor one single entity.

Time dependency:

Entities are not static in how they are perceived. Consider

Ferguson,Missouri , which had a fairly standard city description before the shooting ofMichael Brown happened in August 2014. After this event, the entity descriptionof Ferguson changed substantially, reﬂecting people’s interest in the event, itsaftermath and impact on the city.We propose a method that addresses both challenges raised above. First, we use thecollective intelligence as offered by a wide range of entity “description sources” (e.g.,tweets and tags that mention entities), and we combine these into a “collective entityrepresentation,” i.e., a representation that encapsulates different ways of how people referto or talk about the entity. Consider the example in Figure 5.1 in which a tweet offers avery different way to refer to the entity

Anthropornis than the original knowledgebase description does.Second, our method takes care of the time dependency by incorporating dynamic entitydescription sources, which in turn affect the entity descriptions in near real time. Dynamicsis part of our method in two ways: (i) we leverage dynamic description sources to expandentity representations, and (ii) we learn how to combine the different entity descriptionsfor optimal retrieval at speciﬁc time intervals. The resulting dynamic collective entityrepresentations capture both the different contexts of an entity and its changes over time.We refer to the collection of descriptions from different sources that are associated withan entity as the entity’s representation . Our method is meant to construct the optimal78 .1. Introduction representation for retrieval, by assigning weights to the descriptions from different sources.Collecting terms associated with documents and adding them to the document (docu-ment expansion) is in itself not a novel idea. Previous work has shown that it improvesretrieval effectiveness in a variety of retrieval tasks such as speech retrieval [232] andad-hoc web search [73, 180, 259, 270]. However, while information retrieval on theopen web is inherently dynamic, document expansion for search has mainly been studiedin static, context-independent settings, in which expansion terms from one source areaggregated and added to the collection before indexing [147, 180, 185, 259]. In contrast,our method leverages different dynamic description sources (e.g., queries, social media,web pages), and in this way, uses collective intelligence to bridge the gap between theterms used in a knowledge base’s entity descriptions and the terms that people use to referto entities.To achieve this, we represent entities as ﬁelded documents [162], where each ﬁeldcontains content that comes from a single description source. In a dynamic setting such asour entity ranking setting, where new entity descriptions come in as a stream, learningweights for the various ﬁelds in batch is not optimal. Ideally, the ranker continuouslyupdates its ranking model to successfully rank the entities and incorporate newly incomingdescriptions. Hence, constructing a dynamic entity representation for optimal retrievaleffectiveness boils down to dynamically learning to optimally weight the entity’s ﬁelds thathold content from the different description sources. To this end we exploit implicit userfeedback (i.e., clicks) to retrain our model and continually adjust the weights associatedto the entity’s ﬁelds, much like online learning to rank [121].Our dynamic collective entity representations generate one additional challenge, whichis related to the heterogeneity that exists among entities and among description sources.Popular head entities are likely to receive a larger number of external descriptions thantail entities. At the same time the description sources differ along several dimensions(e.g., volume, quality, novelty). Given this heterogeneity, linearly combining retrievalscores (as is commonly done in structured retrieval models) proves to be suboptimal.We therefore extend our method to include features that enable the ranker to distinguishbetween different types of entities and stages of entity representations. The main researchquestion we seek to answer in this chapter is:

RQ3

Can we leverage collective intelligence to construct entity representations forincreased retrieval effectiveness of entities of interest?To answer this question, we formulate and seek to answer the following three subquestions.First, we check the underlying assumption of our method, and answer our ﬁrst researchquestion:RQ3.1 Does entity ranking effectiveness increase by using dynamic collective entityrepresentations?To answer this question, we compare a baseline entity ranking method based on informa-tion in the knowledge base only to our method that incorporates additional descriptionsources (web anchors, queries, tags, and tweets).Next, we extend our method, and study the contribution of informing the entity rankingof the entity’s description state. We seek to answer the second subquestion: 79 . Retrieving Emerging Entities

RQ3.2 Does entity ranking effectiveness increase when employing additional ﬁeld andentity importance features?We answer this question by incorporating ﬁeld and entity importance features to ourknowledge base-only baseline ranker and our proposed method, and compare their perfor-mance. Finally, we answer this chapter’s third research question, by studying the dynamicaspect of entity ranking:RQ3.3 Does entity ranking effectiveness increase when we continuously learn the optimalway to combine the content from different description sources?We compare a static entity ranking baseline that is trained once at the start, to our proposedmethod that is retrained at regular intervals.The main contribution of this chapter is a novel approach to constructing dynamiccollective entity representations, which takes the temporal and contextual dependencies ofentity descriptions into account. We show that dynamic collective entity representationsbetter capture how people search for entities than their original knowledge base descrip-tions. In addition, we show how ﬁeld importance features better inform the ranker, thusincreasing retrieval effectiveness. Furthermore, we show how continuously updating theranker enables higher ranking effectiveness. Finally, we perform extensive analyses ofour results and show that incorporating dynamic signals into the dynamic collective entityrepresentation enables a better matching of users’ queries to entities.

The problem of entity ranking is: given a query q and a knowledge base KB populatedwith entities e ∈ KB , ﬁnd the best matching e that satisﬁes q . Both e and q are representedin some (individual or joint) feature space that captures a range of dimensions, whichcharacterize them individually (e.g., content, quality) and jointly (e.g., relationshipsthrough click logs).The entity ranking problem itself is a standard information retrieval problem, wherethe system needs to bridge the gap between the vocabulary used by users in queries, andthe vocabulary of the entity descriptions. This is a long-standing but still open problemthat has been tackled from many perspectives. One is to design better similarity functions,another is to develop methods for enhancing the feature spaces. Our method sharescharacteristics with both perspectives, as we will now explain. Our approach to the entity ranking problem consists of two interleaved steps. First, we useexternal description sources (described in Section 5.2.3) to expand entity representationsand reduce the vocabulary gap between queries and entities. We do so by representingentities as ﬁelded documents, where each ﬁeld corresponds to content that comes fromone description source. Second, we train a classiﬁcation-based entity ranker, that employs80 .2. Dynamic Collective Entity Representations different types of features to learn to weight and combine the content from each ﬁeld ofthe entity for optimal retrieval (Section 5.2.4).External description sources may continually update and change the content in theentity’s ﬁelds, through user feedback (i.e., clicks) following an issued query, or when usersgenerate content on external description sources that is linked to a KB entity (e.g., Twitter,Delicious). Consequently, the feature values that represent the entities change, which mayinvalidate previously learned optimal feature weights and asks for continuously updatingthe ranking model.

To construct dynamic collective entity representations we use two types of descriptionsources: (i) a knowledge base ( KB ) from which we extract the initial entity representations,and (ii) a set of external description sources that we use to expand the aforementionedentity representation.Although in theory all types of external sources are allowed in the model, choosingsources that provide short text descriptions for an entity is favorable as they inherently“summarize” entities to a few keywords and do not require an additional step for ﬁlteringsalient keywords. Figure 5.2 provides an example of expanding the representation ofthe entity Tupac Shakur from several sources, each contributing somewhat differentkeywords.We differentiate between external description sources that are non-timestamped ( static )and ones that are timestamped ( dynamic ). Non-timestamped sources are those whereno time information is available, and sources that are not inherently dynamic, e.g., webarchives and aggregates over archives like web anchors. Timestamped external descriptionsources are sources whose content is associated with a timestamp, and where the natureof the source is inherently dynamic or time-dependent, e.g., tweets or query logs. Wedescribe each type of external description source below, the number of total descriptions,and affected entities, and we provide a summary in Table 5.1. Initial Entity Representation

Knowledge base.

The knowledge base that we use as our initial index of entities, andwhich we use to construct the initial entity representations, is a snapshot ofWikipedia from August 3, 2014 with 14,753,852 pages. We ﬁlter out non-entitypages (“special” pages such as category, ﬁle, and discussion pages), yielding4,898,356 unique entities. The initial entity representations consist of the title andbody (i.e., article content) of the Wikipedia page.

Static Description Sources

Knowledge base.

Knowledge base entities have rich metadata that can be leveraged forimproving retrieval [12, 226, 230]. We consider four types of metadata to constructthe KB entity representations: (i) anchor text of inter-knowledge base hyperlinks,(ii) redirects, (iii) category titles, and (iv) titles of entities that are linked from and https://en.wikipedia.org/wiki/Tupac_Shakur . Retrieving Emerging Entities Title

Tupac Shakur

Tupac Amaru Shakur (June 16, 1971 – September 13, 1996), also known by his stage names 2Pac and briefly as Makaveli, was [...]

Text

Murdered rappersDeath Row Records artistsAmerican deists1971 birthsGangsta rap artistsTamalpais High School alumni

Categories

The Notorious B.I.G.Bad Boy RecordsBlack Panther PartyMuammar GaddafiDear Mama

Links

Tupac Shakur2PacTupacMakaveli2 PacTupac Amaru Shakur

Wiki_anchors

Wiki_redirs

What job did Tupac have before he was a rapper?Tupac is arguably more influentialWikipedia page on TupacTupac

Web_anchors

Knowledge base descriptions Static descriptions rapd hiphop/icons biography rap awesomeartist ...

Tags "Even more crazy that this was announced just one day before what would have been Pac's 40th birthday."

Tweets "Happy Birthday Tupac!!! 2Pac Gemini" "RT : Las cenizas de Tupac, el mejor rapero de la historia,fueron mezcladas con marihuana y fumadas por miembros de Outlawz http:…" ... people influenced by tupac Tupac and the law tupac dead rappers new tupac video ...

Queries

Dynamic descriptions

Figure 5.2: Example expansions for the entity

Tupac Shakur .to each entity. Editorial conventions and Wikipedia’s quality control ensures theseexpansions to be of high quality.

Web anchors.

Moving away from the knowledge base itself, the web provides richinformation on how people refer to entities leading to tangible improvements inretrieval [259]. We extract anchor texts of links to Wikipedia pages from the82 .2. Dynamic Collective Entity Representations

Table 5.1: Summary of the nine description sources we consider: Knowledge Base entitydescriptions (KB), KB anchors, KB redirects, KB category titles, KB inter-hyperlinks,queries, web anchors, tweets, and tags from Delicious.Data source Size Period Affected entities

Static expansion sources

KB 4,898,356 August 2014 –KB anchors 15,485,915 August 2014 4,361,608KB redirects 6,256,912 August 2014 N/AKB categories 1,100,723 August 2014 N/AKB inter-links 28,825,849 August 2014 4,322,703

Dynamic expansion sources

Queries 47,002 May 2006 18,724Web anchors 9,818,004 2012 876,063Twitter 52,631 2011–2014 38,269Delicious 4,429,692 2003–2011 289,015Google Wikilinks corpus. We collect 9,818,004 anchor texts for 876,063 entities.Web anchors differ from KB anchors as they can be of lower quality (due to theabsence of editorial conventions) but also of much larger volume. While in theoryweb anchors could be associated with timestamps, in a typical scenario they areaggregated over large archives, where extracting timestamps for diverse web-pagesis non-trivial.

Dynamic Description Sources

Twitter.

Mishne and Lin [185] show how leveraging terms from tweets that do not existin the pages linked to from tweets can improve retrieval effectiveness of those pages.We follow a similar approach and mine all English tweets that contain links toWikipedia pages that represent the entities in our KB. These are extracted from anarchive of Twitter’s sample stream, spanning four years (2011–2014), resulting in52,631 tweets for 38,269 entities.

Delicious.

Social tags are concise references to entities and have shown to outperformanchors in several retrieval tasks [194]. We extract tags associated with Wikipediapages from SocialBM0311 resulting in 4,429,692 timestamped tags for 289,015entities. Queries.

We use a publicly available query log from MSN sampled between May 1 andMay 31, 2006, consisting of 15M queries and their metadata: timestamps and URLsof clicked documents. We keep only queries that result in clicks on Wikipedia pagesthat exist in our snapshot of Wikipedia, resulting in 47,002 queries associated with18,724 uniquely clicked entities. We hold out 30% of the queries for development https://code.google.com/p/wiki-links/ . Retrieving Emerging Entities (e.g., parameter tuning, feature engineering; 14,101 queries) and use 70% (32,901queries) for testing. Our use of queries is twofold: (i) queries are used as input toevaluate the performance of our entity ranking approach, and also (ii) as externaldescription source, to expand the entity description with the terms from a querythat yield a click on an entity. While this dual role of queries may promote headentities that are often searched, we note that “headness” of entities differs acrossdescription sources, and even tail entities may beneﬁt from external descriptions (asillustrated by the Anthropornis example in Fig. 5.1).

The second step in our method is to employ a supervised entity ranker that learns toweight the ﬁelds that hold content from the different description sources for optimalretrieval effectiveness. Two challenges arise in constructing collective dynamic entityrepresentations:

Heterogeneity.

External description sources exhibit different dynamics in terms of vol-ume and quality of content [152], and differences in number and type of entities towhich they link (see, e.g., Table 5.1). This heterogeneity causes issues both withinentities, since different description sources contribute different amounts and typesof content to the entity’s ﬁelds, and between entities, since popular entities mayreceive overall more content from external description sources than tail entities.

Dynamicness.

Dynamic external description sources cause the entity’s descriptions tochange in near real-time. Consequently, a static ranking model cannot capture theevolving and continually changing index that follows from our dynamic scenario,hence we employ an adaptive ranking model that is continuously updated.

In the following section, we describe our supervised ranking approach, by ﬁrst explainingthe entity representation, the set of features we employ for learning an optimal representa-tion for retrieval, and ﬁnally the supervised method.

To deal with dynamic entity representations, which are composed of content from differentexternal description sources, we model entities as ﬁelded documents: e = { f etitle , f etext , f eanchors , . . . , f equery } . (5.1)Here, f e corresponds to the ﬁeld term vector that represents e ’s content from a singlesource (denoted in subscript). We refer to this collection of ﬁeld term vectors as theentity’s representation .The ﬁelds with content from dynamic description sources may change over time. Werefer to the process of adding an external description source’s term vector to the entity’s84 .3. Model corresponding ﬁeld term vector as an update . To model these dynamically changing ﬁelds,we discretize time (i.e., T = { t , t , t , . . . , t n } ), and deﬁne updating ﬁelds as: f equery ( t i ) = f equery ( t i − ) + (cid:40) q, if e clicked , otherwise (5.2) f etweets ( t i ) = f etweets ( t i − ) + tweet e (5.3) f etags ( t i ) = f etags ( t i − ) + tag e . (5.4)In equation 5.2, q represents the term vector of query q that is summed element-wise to e ’s query ﬁeld term vector ( f equery ) at time t i , if e is clicked by a user that issues q . Inequation 5.3, tweet e represents the ﬁeld term vector of a tweet that contains a link tothe Wikipedia page of e , which also gets added element-wise to the corresponding ﬁeld( f etweets ). Finally, in equation 5.4, tag e is the term vector of a tag that a user assigns tothe Wikipedia page of e .To estimate e ’s relevance to a query q given the above-described representation onecould, e.g., linearly combine retrieval scores between q ’s term vector q and each f ∈ e .However, due to the heterogeneity that exists both between the different ﬁelds that makeup the entity representation, and between different entities (described in Section 5.2.4),linearly combining similarity scores may be sub-optimal [223], and hence we employ asupervised single-ﬁeld weighting model [162]. Here, each ﬁeld’s contribution towardsthe ﬁnal score is individually weighted, through learned ﬁeld weights from implicit userfeedback. To learn the optimal entity representation for retrieval, we employ three types of featurethat express ﬁeld and entity importance: ﬁrst, ﬁeld similarity features are computed perﬁeld and boil down to query–ﬁeld similarity scores. Next, ﬁeld importance features,likewise computed per ﬁeld, aim to inform the ranker of the status of the ﬁeld at thatpoint in time (i.e., to favor ﬁelds with more and novel content). Finally, we employ entityimportance features, which operate at the entity level and aim to favor recently updatedentities.

Field Similarity

The ﬁrst set of features model the similarity between a query and a ﬁeld, which we denoteas φ sim . For query–ﬁeld similarity, we compute TF × IDF cosine similarity. We deﬁne φ sim ( q, f , t i ) = (cid:88) w ∈ q n ( w, f ( t i )) · log | C t i ||{ f ( t i ) ∈ C t i : w ∈ f ( t i ) }| , (5.5)where w corresponds to a query term, n ( w, f ( t i )) is the frequency of term w in ﬁeld f at time t i . C t i is the collection of ﬁelds at time t i , and | · | indicates set cardinality.More elaborate similarity functions can be used, e.g., BM25(F), however, we choosea parameter-less similarity function that requires no tuning. This allows us to directlycompare the contribution of the different expansion ﬁelds without having additional85 . Retrieving Emerging Entities factors play a role, such as length normalization parameters, which affect different ﬁeldsin non-trivial ways. Field Importance

The next set of features is also computed per ﬁeld, φ p is meant to capture a ﬁeld’s importance at time t i : φ p ( f ( t i ) , e ) = S ( f ( t i ) , e ) . (5.6)We instantiate four different ﬁeld importance features. First, we consider two ways tocompute a ﬁeld’s length, either in terms (5.7) or in characters (5.8): φ p ( f ( t i ) , e ) = | f ( t i ) | (5.7) φ p ( f ( t i ) , e ) = (cid:88) w ∈ f ( t i ) | w | (5.8)The third ﬁeld importance scoring function captures a ﬁeld’s novelty at time t i , to favorﬁelds that have been updated with previously unseen, newly associated terms to the entity(i.e., terms that were not in the original entity representation at t ): φ p ( f ( t i ) , e ) = |{ w ∈ f ( t i ) : w / ∈ f ( t ) }| . (5.9)The fourth ﬁeld importance scoring function expresses whether a ﬁeld has undergone anupdate at time t i : φ p ( f , t i , e ) = i (cid:88) j =0 + = (cid:40) , if update ( f ( t j ))0 , otherwise , (5.10)where update ( f ( t j )) is a Boolean function indicating whether ﬁeld f was updated attime j , i.e., we sum from t through t i and accumulate updates to the ﬁelds. Entity Importance

The feature φ I models the entity’s importance. We compute the time since the entity lastreceived an update to favor recently updated entities: φ I ( e, t i ) = t i − max f ∈ e time f , (5.11)here, t i is the current time, and time f is the update time of ﬁeld f . max f corresponds tothe timestamp of the entity’s ﬁeld that was most recently updated. Given the features explained in the previous section, we employ a supervised ranker tolearn the optimal feature weights for retrieval:

Ω = ( ω e , ω f title , ω p title , ω p title , . . . , ω f text , ω p text , . . . ) . (5.12)86 .4. Experimental setup Here, Ω corresponds to the weight vector, which is composed of individual weights ( ω )for each of the ﬁeld features (similarity and importance) and the entity importance feature.To ﬁnd the optimal Ω , we train a classiﬁcation-based re-ranking model, and learn fromuser interactions (i.e., clicks). The model employs the features detailed in the previoussection, and the classiﬁcation’s conﬁdence score is used as a ranking signal. As input, theranker receives a feature vector ( x ), extracted for each entity-query pair. The associatedlabel y is positive (1) for the entities that were clicked by a user who issued query q . Wedeﬁne x as x = { φ sim , φ p , φ p , φ p , φ p , . . . ,φ sim | e | , φ p | e | , φ p | e | , φ p | e | , φ p | e | , φ I } , (5.13)where | e | corresponds to the number of ﬁelds that make up the entity representation( f ∈ e ).See Algorithm 1 for an overview of our machine learning method in pseudo-code. Asinput, our supervised classiﬁer is given a set of (cid:104) q, e, L (cid:105) -tuples (see l. 4). These tuplesconsist of a query ( q ), a candidate entity ( e ), and a (binary) label ( L ): positive (1) ornegative (0). Given an incoming query, we ﬁrst perform top- k retrieval (see Section 5.4.4for details) to yield our initial set of candidate entities: E candidate (l. 8). For each entity, weextract the features which are detailed in Section 5.3.2 (l. 12) and have the classiﬁcation-based ranker R output a conﬁdence score for e belonging to the positive class, which isused to rank the candidate entities (l. 13). Labels are acquired through user interactions,i.e., entities that are in the set of candidate entities and clicked after issuing a query arelabeled as positive instances (l. 16–22), used to retrain R (l. 24). Finally, after each query,we allow entities to be updated by dynamic description sources: tweets and tags (l. 26),we provide more details in Section 5.4.3. In this section, we start by describing our experiments and how they allow us to answerthe three subquestions raised in Section 5.1, and then, we present our baselines, machinelearning setting, and describe our evaluation methodology.

Experiment 1.

To answer our ﬁrst subquestion (RQ3.1): does entity ranking effectivenessincrease using dynamic collective entity representations? we compare our proposeddynamic collective entity ranking method to a baseline that only incorporates KBﬁelds for entity ranking:

KBER (Knowledge Base Entity Representations). Werestrict both the baseline and our Dynamic Collective Entity Representation (

DCER )method to the set of ﬁeld similarity features (Section 5.3.2), which we denote as

KBER sim and

DCER sim . This allows us to provide clear insights into the contri-bution of the ﬁelds’ content in ranking effectiveness. In addition, we perform anablation study and compare the similarity-baseline to several approaches that incor-porate content from a single external description source (denoted

KB+source sim ).87 . Retrieving Emerging Entities

Algorithm 1

Pseudo-algorithm for learning optimal Ω T . Require:

Ranker R , Knowledge Base KB , Entities E E ←− { e , e , . . . , e | KB | } e ←− { f title , f anchors , . . . , f text } L = {(cid:104) q , e , { , }(cid:105) , (cid:104) q , e , { , }(cid:105) , . . . , (cid:104) q n , e m , { , }(cid:105)} R ←− Train ( L ) while q do E candidate ←− Top- k retrieval ( q ) E ranked ←− [] for e ∈ E candidate do φ e ←− Extract features ( e ) E ranked ←− Classify ( R, φ e ) end for e clicked ←− Observe click ( q, E ) if e clicked ∈ E candidate then L ←− L ∪ {(cid:104) q, e clicked , (cid:105)} e clicked ←− e clicked ∪ { text q } else L ←− L ∪ {(cid:104) q, e clicked , (cid:105)} end if R ←− Train ( L ) for e ∈ E do e ←− e ∪ { f e,tweet , f e,tag , . . . , f e,tweet i , f e,tag j } end for end whileExperiment 2. We address RQ3.2, does entity ranking effectiveness increase when em-ploying ﬁeld and entity features , by comparing the

KBER sim baseline that onlyincorporates ﬁeld similarity features, to the

KBER baseline that incorporates theentity and ﬁeld importance features, and to our

DCER method.

Experiment 3.

Finally, to answer RQ3.3, does entity ranking effectiveness increase whenwe continuously learn the optimal entity representations? we compare our proposedentity ranking method

DCER to its non-adaptive counterpart (

DCER na ), that we donot periodically retrain. Contrasting the performance of this non-adaptive systemwith our adaptive system allows us to tease apart the effects of adding more trainingdata, and the effect of the additional content that comes from dynamic externaldescription sources. Here too, we include an ablation study, and compare to non-adaptive approaches that incorporate content from a single external descriptionsource (denoted KB+source na ).88 .4. Experimental setup Due to our focus on dynamic entity representations and adaptive rankers, running ourmethod on datasets from seemingly related evaluation campaigns such as those inTREC [13, 14] and INEX [57–59] is not feasible. We are constrained by the size ofdatasets, i.e., we need datasets that are sufﬁciently large (thousands of queries) and time-stamped, which excludes the aforementioned evaluation campaigns, and hence directcomparison to results obtained there. In a scenario with dynamic ﬁelds and continuallychanging content, our method is required to reweigh ﬁelds continuously to reﬂect changesin the ﬁelds’ content. Continually (re-)tuning parameters of existing ﬁelded retrievalmethods such as BM25F [223], when documents in the index change is exceedinglyexpensive, rendering these methods unsuitable for this scenario.For these reasons we consider the following supervised baselines in our experiments;

KBER sim is an online learning classiﬁcation-based entity ranker, that employs ﬁeld simi-larity features on entity representations composed of KB description sources (i.e., title,text, categories, anchors, redirects and links ﬁelds).

KBER is the same baseline system,extended with the full set of features (described in Section 5.3.2). Finally,

DCER na is anon-adaptive baseline: it incorporates all external description sources, and all features asour proposed DCER method, but does not periodically retrain.

In our experiments we update the ﬁelds that jointly represent an entity with externaldescriptions that come in a streaming manner, from a range of heterogeneous externaldescription sources. This assumes that all data sources run in parallel in terms of time.In a real-world setting this assumption may not always hold as systems need to inte-grate historical data sources that span different time periods and are of different size forbootstrapping. We simulate this scenario by choosing data sources that do not originatefrom the same time period nor span the same amount of time (see also Table 5.1). Toremedy this, we introduce a method for time-aligning all data sources (which range from2009 to 2014) to the timeline of the query log (which dates from 2006). We apply a source-time transformation to mitigate the dependence of content popularity on time andwhen it was created [240, 244]. Each query is treated as a time unit, and we distributethe expansions from the different sources over the queries, as opposed to obeying themisaligned timestamps from the query log and expansion sources.To illustrate: given n queries and a total of k items for a given expansion source,after each query we update the entity representations with nk expansions of that particularexpansion source. In our dataset we have 32,901 queries, 52,631 tweets, and 4,429,692tags. After each query we distribute 1 query, 2 tweets, and 135 tags over the entitiesin the index. Mapping real time to source time evenly spreads the content of each datasource within the timespan of the query log. This smooths out bursts but retains thesame distribution of content over entities (in terms of, e.g., entity popularity). Thediminishing effect on burstiness is desirable in the case of misaligned corpora, as burstsare informative of news events, which would not co-occur in the different data sources.Although our re-aligned corpora cannot match the quality of real parallel corpora, our89 . Retrieving Emerging Entities method offers a robust lower bound to understand the utility of collective dynamic entityrepresentations of our method. We reiterate that our goal in this chapter is to study theeffect of dynamic entity representations and adapting rankers, not to leverage temporalfeatures for improving entity retrieval. We apply machine learning for learning to weight the different ﬁelds that make up an entityrepresentation for optimal retrieval effectiveness. In response to a query, we ﬁrst generatean initial set of candidate entities by retrieving the top- k entities to limit the requiredcomputational resources for extracting all features for all documents in the collection [159].Our top- k retrieval method involves ranking all entities using the similarity functiondescribed in Section 5.3.2, where we collapse the ﬁelded entity representation into a singledocument. We set k = 20 as it has shown a fair tradeoff between high recall (80.1% onour development set) and low computational expense. We choose Random Forests asour machine learning algorithm because it has proven robust in a range of diverse tasks(e.g., [187]), and can produce conﬁdence scores that we employ as ranking signal. In ourexperiments, we set the number of trees to 500 and the number of features each decisiontree considers for the best split to (cid:112) | Ω | . For evaluating our method’s adaptivity and performance over time, we create a set ofincremental time-based train/test splits as in [22]. We ﬁrst split the query log into K chunks of size N : { C , C , C , . . . , C K } . We then allocate the ﬁrst chunk ( C ) fortraining the classiﬁer and start iteratively evaluating each succeeding query. Once thesecond chunk of queries ( C ) has been evaluated, we expand the training set with it andretrain the classiﬁer. We then continue evaluating the next chunk of queries ( C ). Thisprocedure is repeated, continually expanding the training set and retraining the classiﬁerwith N queries (we set N =500 in our experiments). In this scenario, users’ clicks aretreated as ground truth and the classiﬁer’s goal is to rank clicked entities at position 1. Wedo not distinguish between clicks (e.g., satisﬁed clicks and non-satisﬁed clicks), and weleave more advanced user models, e.g., which incorporate skips, as future work.To show the robustness of our method we apply ﬁve-fold cross-validation over eachrun, i.e., we generate ﬁve alternatively ordered query logs by shufﬂing the queries. Wekeep the order of the dynamic description sources ﬁxed to avoid conﬂating the effect ofqueries’ order with that of the description sources.Since we are interested in how our method behaves over time, we plot the MAP ateach query over the ﬁve-folds, as opposed to averaging the scores over all chunks acrossfolds (as in [22]) and losing this temporal dimension. In addition to reporting MAP, wereport on P@1, as there is only a single relevant entity per query in our experimental setup.We test for statistical signiﬁcance using a two-tailed paired t-test. Signiﬁcant differencesare marked (cid:78) for α = 0 . .90 .5. Results and Analysis Table 5.2: Performance of ﬁeld similarity-based entity ranking methods using KB entityrepresentations (

KBER sim ) and dynamic collective entity representations (

DCER sim ) interms of MAP and P@1. Signiﬁcance tested against

KBER sim . Oracle marks upper boundperformance given our top- k scenario. Run MAP P@1

KBER sim sim (cid:78) (cid:78)

Oracle 0.6653 0.6653

We report on the experimental results for each of our three experiments in turn, andprovide an analysis of the results to better understand the behavior of our method.

In our ﬁrst experiment, we explore the impact of the description sources we use forconstructing dynamic collective entity representations, and aim to answer RQ3.1. Wecompare the

KBER sim baseline, which incorporates ﬁeld similarity on KB descriptions,to our

DCER sim method, which incorporates ﬁeld similarity features on all descriptionsources (web anchors, tweets, tags, and queries). Table 5.2 shows the performance interms of MAP and P@1 of the baseline (

KBER sim ) and

DCER sim after observing allqueries in our dataset. We include an oracle run as a performance upper bound given ourtop- k retrieval scenario.The results show that the dynamic collective entity representations manage to sig-niﬁcantly outperform the KB entity representations for both metrics, and that DCER sim presents the correct entity at the top of the ranking for over 55% of the queries.Next, we look into the impact on performance of individual description sources forour dynamic collective entity representations. We add each source individually to the

KBER sim baseline. Figure 5.3 shows how each individual description source contributes toproduce a more effective ranking, with

KB+tags narrowly outperforming

KB+web as thebest single source. Combining all sources into one, yields the best results, outperforming

KB+tags by more than 3%. We observe that after about 18,000 queries,

KB+tags overtakes the (static)

KB+web method, suggesting that newly incoming tags yield higherranking effectiveness. All runs show an upward trend and seem to level out around the30,000th query. This pattern is seen across ranking methods, which indicates that theupward trend can be attributed to the addition of more training data (queries).Table 5.3 lists the results of all methods along with the improvement rate (relative im-provement when going from 10,000 to all queries). The runs that incorporate dynamicdescription sources (i.e.,

KB+tags , KB+tweets , and

KB+queries ) show the highestrelative improvements (at 6.4%, 6.1% and 6.9% respectively). Interestingly, for P@1, the

KB+queries method yields a substantial relative improvement (+8.9%), indicating thatthe added queries provide a strong signal for ranking the clicked entities at the top. 91 . Retrieving Emerging Entities

Figure 5.3: Impact on performance of individual description sources. MAP on the y-axis,number of queries on the x-axis. The line represents MAP, averaged over 5-folds. Standarddeviation is shown as a shade around the line. This plot is best viewed in color.The comparatively lower learning rates of methods that incorporate only static descrip-tion sources (

KBER sim and

KB+web yield relative improvements of +5.8% and +5.5%,respectively), suggest that the entity content from dynamic description sources effectivelycontributes to higher ranking performance as the entity representations change and theranker is able to reach a new optimum.

DCER sim , the method that incorporates all avail-able description sources, shows a comparatively lower relative improvement, which islikely due to it hitting a ceiling, and not much relative improvement can be gained.

Feature Weights over Time

A unique property in our scenario is that, over time, more training data is added to thesystem, and more descriptions from dynamic sources come in, both of which are expected92 .5. Results and Analysis

Table 5.3: Comparison of relative improvement between runs with different ﬁeld similarityfeatures. We report on MAP and P@1 at query 10,000 and the last query. Rate correspondsto the percentage of improvement between the 10,000th and ﬁnal query. Signiﬁcancetested against

KBER sim . Run MAP(10k) MAP(end) Rate P@1(10k) P@1(end) Rate

KBER sim sim (cid:78) +5.5% 0.4965 0.5282 (cid:78) +6.4%KB+Tags sim (cid:78) +6.4% 0.4930 0.5317 (cid:78) +7.8%KB+Tweets sim (cid:78) +6.1% 0.4673 0.5021 (cid:78) +7.5%KB+Queries sim (cid:78) +6.9% (cid:78) +8.9%

DCER sim (cid:78) +6.2% 0.5178 (cid:78) +7.6%to improve the performance of our method. To better understand our the behavior ofour method, we look into the learned feature weights for both a static approach and ourdynamic one at each retraining interval. We consider as feature importance the averageheight of a feature when it is used as a split-node in one of the trees that make up therandom forest. The underlying intuition is that features higher up a decision tree affect alarger fraction of the samples, and hence can be considered more important. Figure 5.4shows the weights for six static ﬁelds (categories, title, anchors, links, text, redirects) andthree dynamic ﬁelds (queries, tweets, tags).The

KBER sim baseline shows little change in the feature weights as more trainingdata is added. The anchors’ weight increases slightly, while the text and categories ﬁelds’weights steadily decline over time. The latter two features show a steadily decreasingstandard deviation, indicating that the ranker becomes more conﬁdent in the assignedweights.For

KB+queries it is apparent that the queries ﬁeld weight increases substantially asboth the ranking method receives more training data and the entity representations receivemore content from the queries description source. At the same time, anchors, redirects,and title ﬁeld weights, which were assigned consistently high weights in

KBER sim , seemto pay for the increase of the importance of queries. Text, category, and links showsimilar patterns to the baseline run; steadily declining weights, converging in terms of adecreasing standard deviation.The

KB+tags run shows a similar pattern as

KB+queries : we observe an increaseover time of the weight assigned to the ﬁeld that holds content from the tag descriptionsource, at the cost of original KB ﬁelds, resulting in improved ranking effectiveness.Looking at

KB+tweets , however, we observe a different pattern. The tweets ﬁeld startsout with a very low weight, and although the weight steadily increases, it remains low.The ranker here, too, becomes more conﬁdent on this source’s weight, with the standarddeviation dissolving over time. Looking at the higher performance of

KB+tweets incomparison to the

KBER sim baseline, together with the ﬁeld’s weight increasing overtime, we conclude that the tweets added to the entity representations provide added value.93 . Retrieving Emerging Entities

Figure 5.4: Feature weights over time: y-axis shows (relative) feature weights, x-axisshows each chunk of 500 queries where the ranker is retrained. Starting from top left ina clock-wise direction we show the following runs: KBER sim (baseline), KB+Queries,KB+Tags, KB+Tweets. The black line shows the dynamic description source’s weight.This plot is best viewed in color.

In our second experiment, which aims to answer RQ3.2, we turn to the contributionof the additional ﬁeld and entity importance features. Table 5.4 lists the results of thebest performing run with only ﬁeld similarity features (

DCER sim ), the

KBER baselinethat incorporates the full set of features, and our proposed

DCER method which likewiseincorporates the full set of features.The results show that modeling ﬁeld and entity importance signiﬁcantly improveseffectiveness of both the

KBER sim baseline and the

DCER sim runs. After running through94 .5. Results and Analysis

Table 5.4: Comparison of relative improvement between the

KBER baseline with ﬁeldimportance features for KB ﬁelds, and our

DCER method with these features for allﬁelds. We also show the best performing run without ﬁeld and entity importance features.Signiﬁcance tested against

KBER . Run MAP(10k) MAP(end) Rate P@1(10k) P@1(end) Rate

Oracle – 0.6653 – – 0.6653 –DCER sim (cid:78) +4.7% 0.5655 (cid:78) +4.8%the entire dataset, the performance of

DCER approaches that of the oracle run which alsoexplains why the differences between the two approaches here is rather small: we are veryclose to the maximum achievable score.Figure 5.5 shows the performance of the two approaches over time. The patternis similar to the one in Section 5.5.1: both lines show a steady increase, while

DCER maintains to be the best performing.

In our third experiment, where we aim to answer RQ3.3, we compare our adaptive ranker(

DCER ), which is continuously retrained, to a non-adaptive baseline (

DCER na ), which isonly trained once at the start and is not retrained.Table 5.5: Comparing relative improvement between runs with and without an adaptiveranker. Statistical signiﬁcance tested between the KBER and

DCER approaches and theirnon-adaptive counterparts.

Run Adaptive MAP P@1

KBER na no 0.5198 0.4392KBER yes 0.5579 (cid:78) (cid:78) DCER na no 0.5872 0.5408DCER yes (cid:78) (cid:78) Results in Table 5.5 show that incrementally re-training the ranker is beneﬁcial to entityranking effectiveness. For both

KBER and

DCER we see a substantial improvement whenmoving from one single batch training to continuously retraining the ranker.To better understand this behavior, we plot in Figure 5.6 the performance of all runs weconsider over time and at the same time; Table 5.6 provides the detailed scores. Broadlyspeaking we observe similar patterns between adaptive and non-adaptive methods, andwe identify three interesting points. First, for the non-adaptive methods, the absoluteperformance is substantially lower across the board. Second, for the adaptive methods,95 . Retrieving Emerging Entities

Figure 5.5: Runs with ﬁeld similarity, ﬁeld importance, and entity importance features.MAP on the y-axis, number of queries on the x-axis. The line represents MAP, averagedover 5-folds. Standard deviation is shown as a shade around the line. This plot is bestviewed in color.the standard deviation (as shown in Figures 5.3 and 5.6) is substantially lower, whichindicates that retraining increases the ranking method’s conﬁdence in optimally combiningthe different descriptions for the entity representation. Third, the learning rates in theadaptive setting are substantially higher, reafﬁrming our observation that learning a singleglobal feature weight vector is not optimal.Table 5.6 shows that the difference in learning rates between methods that onlyincorporate static description sources (

KBER , KB+web ) and methods that incorporatedynamic sources is pronounced, in particular for the tags and queries sources. Thisindicates that even with ﬁxed feature weights, the content that comes in from the dynamicdescription sources yields an improvement in entity ranking effectiveness. Finally, the96 .6. Conclusion

Figure 5.6: Individual description sources with non-adaptive ranker. MAP on the y-axis,number of queries on the x-axis. The line represents MAP, averaged over 5-folds. Standarddeviation is shown as a shade around the line. This plot is best viewed in color.methods that only incorporate static description sources also show lower learning ratesthan their adaptive counterparts, which indicates that retraining and adapting featureweights is desirable even with static entity representations.

In this chapter, we addressed the entity ranking task, and set out to improve it by enrichingentity representations from entity descriptions collected from a wide variety of sources.We set out to answer the following question: 97 . Retrieving Emerging Entities

Table 5.6: Comparing relative improvement between non-adaptive runs. Signiﬁcancetested against

KBER na . Run MAP(10k) MAP(end) Rate P@1(10k) P@1(end) Rate

KBER na na (cid:78) +3.3% 0.4698 0.4829 (cid:78) +2.8%KB+Tags na (cid:78) +4.7% 0.4671 0.4904 (cid:78) +5.0%KB+Tweets na (cid:78) +3.8% 0.4334 0.4490 (cid:78) +3.6%KB+Queries na (cid:78) +7.1 % 0.4659 0.5090 (cid:78) +9.2%DCER na (cid:78) +5.8% 0.5063 0.5408 (cid:78) +6.8% RQ3

Can we leverage collective intelligence to construct entity representations forincreased retrieval effectiveness of entities of interest?We have shown an effective way of leveraging collective intelligence to construct entityrepresentations for increased retrieval effectiveness of entities of interest.To answer RQ3 we formulated and answered (in Section 5.5) three subquestions. First,we aim to answer:

RQ3.1

Does entity ranking effectiveness increase by using dynamic collective entityrepresentations?We demonstrate that incorporating dynamic description sources into dynamic collectiveentity representations enables a better matching of users’ queries to entities, resulting in anincrease of entity ranking effectiveness. More speciﬁcally, we show how each individualdescription source contributes to produce a more effective ranking (see Figure 5.3).Looking at the contribution of individual description sources, we see how the highestrelative improvements come from dynamic description sources, and social tags proveto be the strongest individual signal. In addition, we study how the ranker adapts theindividual weights of ﬁelds (reﬂecting the relative importances of expansion sources) inSection 5.5.1, and see how dynamic description sources gain higher importances as newdescriptions come in.Next, we answer subquestion RQ3.2:

RQ3.2

Does entity ranking effectiveness increase when employing additional ﬁeld andentity importance features?We show that informing the ranker on the expansion state of the entities, i.e., employingﬁeld and entity importance features, further increases the ranking effectiveness.Finally, we answer:

RQ3.3

Does entity ranking effectiveness increase when we continuously learn the optimalway to combine the content from different description sources?98 .6. Conclusion

We compare an adaptive ranker (i.e., one that we periodically retrain, as descriptions comein) to a static one. The results show how retraining the ranker leads to improved rankingeffectiveness in dynamic collective entity representations. We also show how even thestatic ranker improves over time, suggesting that even the static ranker beneﬁts from newlyincoming descriptions, if not as much as the adaptive method.The ﬁndings in this work have several implications in the discovery context. Namely,we have shown that we can effectively leverage the feedback of searchers (i.e., their queriesand clicks) to improve retrieval effectiveness, which may be beneﬁcial in the discoveryprocess. Furthermore, we have shown that we can effectively combine heterogeneousinformation related to an entity of interest into a single representation, which suggests ourmethod can be useful in a scenario where data is available from many different sources, ascenario not uncommon in the context of E-Discovery. Consider, e.g., the scenario whereemails and attachments in various different formats needs to be considered, a scenario thatis not uncommon in E-Discovery [197]. By combining the content that is explicitly linked(e.g., attachments to email addresses), we have shown that regardless of the nature of thedata, our ranking method can effectively incorporate the additional data for improvedretrieval effectiveness.The work presented in this chapter is not without limitations, however. Our study of theimpact of dynamic collective entity representations was performed in a controlled scenario,where we collect and leverage different data sources. One restriction of working withthese freely available resources is that it proves hard to ﬁnd aligned and sizeable datasets.In particular, the temporal misalignment between different corpora prevents the analysisof temporal patterns that may span across sources (e.g., queries and tweets showingsimilar activity around entities when news events unfold). In part, these restrictions can becircumvented in future work, e.g., increasing the (comparatively) low number of tweetsby enriching them through e.g., entity linking methods [176], where entities identiﬁedin tweets could be expanded. Additional challenges and opportunities may arise whenincreasing the scale of the data collections. Opportunities may lie in exploiting session oruser information for more effective use of user interaction signals. Challenges includeso-called “swamping” [223] or “document vector saturation” [138], i.e., entity drift that ismore prone to happen when the size of the data collections increase.A natural extension to this work is to study this problem in a realistic scenario, withaligned datasets of sufﬁcient size, which would allow us to study the impact of “swamping”entities in more detail, as well as provide a more ﬁne-grained analysis of the impact ofthe importance features. But more importantly, it would allow to study the link betweenthe temporal patterns of entities (as studied in Chapter 3) and their retrieval effectiveness.Novel features that take into account the novelty in ﬁelds (e.g., the number of terms thatwere not in the original entity description), and the diversity could prove useful in adaptingfor entity drift. Furthermore, whilst we consider the changing entity representation nowby its impact on retrieval effectiveness, another follow-up may investigate the changesin entity representations on a qualitative level, e.g., by studying per ﬁeld and/or entitythe type of additional descriptions (similar to original vs. novel), the rate of changingrepresentations, etc. 99 art II

Analyzing and PredictingActivity from Digital Traces

Analyzing and Predicting EmailCommunication “I speak to everyone in the same way,whether he is the garbage man or the president of the university.” —Albert Einstein

In this ﬁrst chapter of this part of the thesis, we present our ﬁrst case study into predictingactivities of entities of interest from digital traces. More speciﬁcally, we propose a methodfor predicting the recipient of an email, by leveraging both the contexts in which thedigital traces are created, as their content.Knowing the likely recipients of an email can be helpful from a user-perspective(e.g., for pro-actively suggesting recipient), but also in the context of discovery. Reliablypredicting communication ﬂows may ﬁnd applications in discovering entities of interest ,e.g., by comparing actual communication ﬂows to “expected” (i.e., predicted) ﬂows, onecan identify communication that is “unexpected,” which can be valuable signal in thecontext of digital forensics [200].Despite the huge increase in the use of social media channels and online collaborationplatforms, email remains one of the most popular ways of (online) communication. Emailtrafﬁc has been increasing steadily over the past few years, and recent market studies haveprojected its continued growth for the years to come [212]. Recipient recommendationmethods aim at providing the sender of an email with the appropriate predicted recipientsof the email that is currently being written. These methods furthermore allow us to gaina better understanding of communication patterns in enterprises, potentially revealingunderlying structures of communication networks.In this chapter, we focus on recipient prediction in an enterprise setting, as it allowsus to leverage the full content and structure of the communication network, as opposedto strictly local (ego network) approaches, that are restricted to only the sender’s directnetwork. Furthermore, enterprise email databases are common objects of study in theE-Discovery scenario, as can be seen by the TREC Legal track which aims to closely103 . Analyzing and Predicting Email Communication resemble a real-world E-Discovery task [46, 117]. Henseler [118] likewise studied emailin an E-Discovery setting. He found that combining keyword search with communicationpattern analysis reduces the amount of information reviewers need to process.We propose a novel hybrid generative model, which incorporates both the communica-tion graph structure and email content signals for recipient prediction. Our model predictsrecipients from scratch, i.e., without assuming seed recipients for prediction. Furthermore,our model can quickly deal with updates in both the communication graph and the proﬁlesof recipients when new emails are sent around.Our main research question revolves around the contribution of context of digitalsources (i.e., social context or organizational structure) and content of digital sources(i.e., the actual email content) towards effectiveness of predicting communication. Morespeciﬁcally, this chapter aims to answer RQ4:

RQ4

Can we predict email communication through modeling email content and commu-nication graph properties?On top of that we investigate how we can optimally estimate the various components ofour model. We train our recipient prediction model on one email collection (the Enronemail collection) and test it on another (the Avocado email collection).Our main ﬁnding is that a combination of the communication graph and email con-tent outperforms the individual components. We obtain optimal performance when weincorporate the number of emails a recipient has received so far and the number of emailsa given sender sent to a recipient at that point in time in our model. Other options, likeusing PageRank as a recipient prior, do not improve performance. Our ﬁndings show thatboth the context and content of digital sources contribute to prediction effectiveness.

To model the context in which communication happens, we construct a communicationgraph from all emails sent by users in our email collections. This graph implicitly capturesthe (social or professional) relations between users in an email network, and hence modelsan aspect of the context of communication.We consider the email trafﬁc as a directed graph G , consisting of a set of vertices andarcs (directed edges) (cid:104) V, A (cid:105) . Each vertex v ∈ V represents a distinct email address in thecorpus (i.e., a sender S or recipient R in terms of our modeling in Section 6.3), and arcs a ∈ A that connect them represent the communication between the two correspondingaddresses (i.e., emails exchanged), directed from the sender node to the recipient(s)node(s). The arcs are weighted by the number of emails sent from one user to the other.The communication graph allows us to model the network and interactions by consid-ering several well-established graph-based metrics, that represent contextual signals suchas the proximity between emailers (i.e., connection strength), and larger structures withinthe network of individuals. One example is to measure a user’s relative importance in thecommunication graph through her PageRank -score. The PageRank algorithm measuresa user’s (node’s) relative importance through its connected arcs and their correspondingweights. In this model, each arc is a “vote of support,” and thus users with a larger numberof interactions receive a higher PageRank score [201].104 .3. Modeling

We update the communication graph after each email that is being sent, i.e., each emaileither adds a new arc to the graph, or updates the weight of an existing one. We describethe utility of our communication graph for recipient prediction in Section 6.3.

We propose a generative model that is aimed at calculating the probabilities of recipientsgiven the sender and the content of the email. Instead of predicting a single recipient, wecast the task as a ranking problem in which we try to rank the appropriate recipients ashigh as possible.More formally, let R be a candidate recipient, S the sender of an email, and E theemail itself. Our ﬁnal ranking will be based on the probability of observing R given S and E : P ( R | S, E ) . We use Bayes’ Theorem to rewrite this probability: P ( R | S, E ) = P ( R ) · P ( S | R ) · P ( E | R, S ) P ( S ) · P ( E | S ) . (6.1)We can explain equation 6.1 as follows: the “relevance” of a recipient is determined by(i) her prior probability (how likely is this person to receive email in general, P ( R ) inequation 6.1), (ii) the probability of observing the sender with this particular recipient( P ( S | R ) in equation 6.1), and (iii) the likelihood of this email to be generated fromcommunication between the recipient and the sender ( P ( E | R, S ) in equation 6.1). Toobtain the ﬁnal probability, we normalize using the prior probability of the sender, and thelikelihood of observing this email given its sender ( P ( S ) · P ( E | S ) in equation 6.1).For ranking purposes we can ignore P ( S ) and P ( E | S ) , which will be the same forall recipients. Our ﬁnal ranking function is displayed in equation 6.2. P ( R | S, E ) ∝ P ( E | R, S ) · P ( S | R ) · P ( R ) . (6.2)In the next three sections we explain how we estimate the three components of the model:the email likelihood ( P ( E | R, S ) ), the sender likelihood ( P ( S | R ) ), and the recipientprior ( P ( R ) ). We have several options when it comes to estimating P ( E | R, S ) . We could, for example,incorporate individual emails as latent variables. However, in this chapter we opt todirectly estimate the email likelihood using the terms in the email (viz. equation 6.3). P ( E | R, S ) = (cid:89) w ∈ E [ λP ( w | R, S ) + γP ( w | R ) + βP ( w )] . (6.3)In this estimation, P ( w | R, S ) indicates the probability of observing a term w in allemails exchanged between S and R . To prevent zero probabilities, we smooth thisprobability with the term probability in all emails sent and received by R (i.e., P ( w | R ) in equation 6.3) and the term probability over the whole collection (i.e., P ( w ) inequation 6.3). In other words, we apply unigram language modeling for all the emails105 . Analyzing and Predicting Email Communication a. b. c. R S R SR R R SR RR RR

Figure 6.1: The three unigram LMs which we model from email communication. Here,the red node represents the email sender ( S ), the dark grey node the (candidate) recipientnode(s) ( R ), and the light grey nodes the remaining nodes in graph G . Figure showswhich emails are used to estimate P ( E | R, S ) (a.), P ( w | R ) (b.), and P ( w ) (c.)between S and R , which we smooth by the unigram LM of R (i.e., both incoming andoutgoing emails), and the unigram LM of the entire background corpus (i.e., all emails inthe collection up to that point in time). See also Figure 6.1.We introduce three parameters, λ , β , and γ that represent the relative importance ofeach of the three components of equation 6.3, with λ + γ + β = 1 , to combine the threeterm probabilities.We calculate each of the three term probabilities using the maximum likelihoodestimate, i.e., P ( w | · ) = n ( w, · ) |·| , the frequency of term w in the set of documents dividedby the length of this set in number of terms. We move to the estimation of P ( S | R ) , the likelihood of observing the sender for a givenrecipient. Here, we use the communication graph constructed from email exchanges. Thecloser a recipient is in this graph and the stronger her connection to the sender S , the morelikely it is that the two “belong together.” We estimate this connection strength in twoways: by considering (i) the frequency ( freq ), or the number of emails S sent to R at thatpoint in time, and (ii) co-occurrence ( co ), or the number of times S and R co-occur asaddressees in an email. More speciﬁcally, the frequency-based probability is deﬁned as: P freq ( S | R ) = n ( e, S → R ) (cid:80) S (cid:48) ∈S n ( e, S (cid:48) → R ) , (6.4)where n ( e, x → R ) indicates the number of emails sent from x to R and S is the set ofall senders in the graph at the current point in time. The co-occurrence-based probabilityis deﬁned as: P co ( S | R ) = n ( e, → R, S ) n ( e, → R ) + n ( e, → S ) , (6.5)where n ( e, → R, S ) corresponds to the number of emails sent to both R and S (i.e.,number of emails of which S and R are both recipients), and n ( e, → X ) the number ofemails sent to X .106 .4. Experimental Setup Table 6.1: Summary of Enron and Avocado email collections. We list the time span inmonths (Period), total number of Emails, total number of employee addresses (Addr.), theaverage number of emails sent (

S/p ) and received (

R/p ) per address.Period Emails Addr.

S/p R/p

Enron 45 252,424 6,145 34 294Avocado 58 607,011 2,068 174 321

Finally, we introduce a recipient prior, that is, the email independent probability thata recipient will be observed. This probability is unrelated to both the email at hand( E ) and the sender of that email ( S ) and can be estimated without knowing these twovariables. Again, we can choose from a variety of ways to estimate this prior probability,but we stick to two obvious choices. First, we use the number of emails received by R ,normalized by the total number of emails sent at that point in time ( rec ). This estimationindicates how likely it is that any given email would be sent to this recipient. Second, wecalculate recipient R ’s PageRank score ( pr ) as an indication of how important R is in thecommunication graph. To study the contribution of the context and content at which digital traces are created,we evaluate our recipient recommendation model using a realistic experimental setup.Our experiments aim at demonstrating the prediction effectiveness of the individualcomponents of our model, i.e., the communication graph (CG) component and the emailcontent component (EC), and their combination (CG+EC) as in equation 6.3. We usereal enterprise email databases to evaluate our model. We optimize our models on theEnron email collection [141], and test it on the Avocado collection, which consists ofemail boxes of employees of an IT ﬁrm that developed products for the mobile Internetmarket [196]. The two collections are described in Table 6.1.For both collections we follow the same method to select the set of users for evaluation.We ﬁrst split users into three groups based on their email activity: high activity, mediumactivity, and low activity. This way we can study the correlation between a user’s levelof activity and the model’s performance. As email networks typically show a long-taileddistribution [171], with a small number of users responsible for a large volume of the sentmails, and a large number of users responsible for a small volume, we deﬁne a user’sactivity by taking the logarithm of the number of sent emails. We prune users that haveless then 100 sent mails, and compute the resulting distribution’s mean ( µ ) and standarddeviation ( σ ) and split the distribution into three bins: ﬁrst, (i) users with low activity(LOW) are those that fall below µ − σ , (ii) medium active users (MED) are those between µ − σ and µ + σ , and ﬁnally, (iii) highly active users (HIGH) are the ones over µ + σ .From each bin we randomly sample 50 users, which results in our ﬁnal evaluation set of150 users. 107 . Analyzing and Predicting Email Communication time Construction period: build models (CG + EC)

Testing period: predict recipients

Figure 6.2: Our evaluation methodology, where we have an initial construction period togenerate language models and build the communication graph. During the subsequent testing period we select emails for which predict recipients (represented by the dottedlines). We continue building the communication graph and updating the language modelsin the testing period.

Before we start predicting email recipients we allow our model to gather evidence fromall email communication up to that point. More speciﬁcally, we use an initial constructionperiod to generate the users’ language models and the communication graph. We startto predict recipients in the subsequent testing period . During both periods our model isupdated for each sent email. We split each user’s period of activity (starting from theuser’s ﬁrst sent email, up to the last sent mail) into the construction period , covering ofthe emails, and the testing period , which is of the emails. See also Figure 6.2).For each sender in our user evaluation set, we select 10 emails, evenly distributedover the testing period as evaluation points. For each of these emails we rank the top 10recipients and compare to the actual recipients of the email. We report on mean averageprecision (MAP), as it allows us to identify improvements in the ranking of recipients.We indicate the best performing model in boldface and test statistical signiﬁcance using atwo-tailed paired t -test. Signiﬁcant differences are marked (cid:78) / (cid:72) for α = 0.01 and (cid:77) / (cid:79) for α = 0.05. We use the Enron collection to tune the parameters λ , γ , and β in equation 6.3, and todetermine which methods for estimating the sender ( P ( S | R ) ) and recipient ( P ( R ) )likelihood work best. The results of the parameter tuning are displayed in Tables 6.2 and6.3. The ﬁnal settings we use for testing our model on the Avocado collection are thefollowing: λ = 0 . , γ = 0 . , β = 0 . , we estimate P ( S | R ) using the number of emails S sent to R , P freq ( S | R ) , and we estimate P ( R ) using the number of emails received by R , P rec ( R ) . We compare the found optimal settings for the CG and EC components to their com-bination using our training collection (Enron). Table 6.4 shows that our hybrid modelsigniﬁcantly outperforms either of the single models across all groups of users in theEnron collection, even if the performance increase is modest. The highest performance is108 .5. Results and Analysis

MED

HIGH

ALL

Table 6.3: System performance (MAP) on the Enron dataset over a parameter sweep forthe parameters λ , γ , and β = 1 − ( λ + γ ) in equation 6.3 with a step size of 0.2. λ ↓ / γ → achieved in the LOW and MED user groups: lower user activity correlates positively withperformance.We present the results of our ﬁnal experiments on the Avocado set in Table 6.4.Compared to the results on the Enron set, which we used for tuning our parameters, ourmodel’s performance is higher throughout on the Avocado set, both across the differentmodels, and within each subgroup of users. An indication for this difference in absoluteperformance scores comes from the collection statistics in Table 6.1. Here we see thatthe Avocado set contains fewer unique addressees, spans a longer period of time, andcontains a larger number of emails per person. As a possible factor contributing to ourmodels’ higher performance on the Avocado collection, we point to the fact that our modelhas more data available to leverage for ranking a smaller number of candidates. Whiledifferent datasets may need different models, the consistently high scores show that thecomponents work in isolation and in combination over different datasets. Similar to what we saw in the experiment on the Enron collection, higher user activitygenerally seems to result in lower performance on the Avocado dataset (Table 6.4). TheCG model is an exception and outperforms our combined model for highly active users.The combined model achieves signiﬁcant performance improvements over the contentmodel in each subset of users.To better understand these patterns, we turn to the characteristics of the different usersubgroups. We plot the users’ numbers of emails (both sent and received), indicatingtheir activity, and juxtapose it to the size of their egonet , which corresponds to the set ofdirectly connected neighbors in the communication graph [5]. This egonet represents the109 . Analyzing and Predicting Email Communication

Table 6.4: System performance (MAP) on Enron and Avocado. We test for statisticallysigniﬁcant differences of the individual models (CG and EC) against the combined model(CG+EG). Enron AvocadoCG EC CG+EC CG EC CG+ECLOW 0.4365 (cid:72) (cid:72) (cid:72) (cid:72)

MED 0.3857 (cid:72) (cid:72) (cid:79) (cid:72)

HIGH 0.2060 (cid:72) (cid:79) (cid:77) (cid:72) (cid:72) (cid:72) (cid:72)

Table 6.5: Trend line slopes after applying linear regression to the individual and combinedcomponents data.

Model Slope

Email Content (EC) . · − Communication Graph (CG) − . · − Combined (EC+CG) . · − users they interact(ed) with, and is indicative of a user’s reach or embedding inside thecommunication graph.The resulting plot is shown in Figure 6.3. There is a clear clustering: highly activeusers (orange squares) who send and receive a large number of emails, also have a largernumber of people they interact with. While more textual content allows the generativemodel to create richer recipient proﬁles, in turn enabling more informed recipient ranking,there is a catch to a larger egonet too. The sender-recipient communication smoothingin our generative model results in a larger number of high-scoring candidates for highlyactive users. This makes it more difﬁcult for the ranker to discern the true recipient(s)from the larger pool. Motivated by the fact that our model is updated at each sent email, we study its perfor-mance over time. To get an indication of whether our model’s performance improves ordeteriorates over time, we apply linear regression on the data points of each model, and ﬁta trend line. See Table 6.5 for the trend line slopes for each individual component, and thecombined model.Both the EC and combined model’s trend lines have positive slopes, whilst the CGmodel has a negative slope. This indicates that our language modeling approach beneﬁtsfrom the larger amount of textual content it receives for each recipient over time, allowingthe generation of richer recipient proﬁles for better email likelihood estimations. TheCG approach on the other hand, deteriorates over time, suffering from the increasedsize and complexity of the communication graph. We note that, as time progresses, the110 .5. Results and Analysis Egonetwork size (no. of neighbors) A c t i v i t y ( no . o f e m a il s ) HIGHMEDLOW

Figure 6.3: The three user groups in Avocado, showing each user’s activity (y-axis) againstthe size of its egonetwork (x-axis).communication graph model “settles in,” and becomes less likely to pick up on changingbalances or shifting communication patterns in the communication graph. For future workwe argue for a time-aware model, that is able to adapt to shifts in the communicationgraph over time, by taking recency into account.

To analyze the performance of our combined model in comparison to the isolated ones,we look at their rankings and compute the Kendall tau rank correlation coefﬁcient ( τ )between pairs of models. The top plot in Figure 6.4 shows how agreement between CGand EC is relatively low, centering around the 0 mark with an average of 0.0471. Thispattern largely coincides with that of the second plot, which depicts agreement betweenCG and our combined model. The average correlation coefﬁcient is only slightly higher at0.0562. Finally, the agreement is highest between the EC model and our combined model,at on average 0.7425. The high agreement offers an explanation for the comparativelylow performance of our combined model in the HIGH subset of the Avocado set. TheEC model’s comparatively low performance (as seen in Table 6.4), indicates that thecombined model is negatively affected by following EC’s incorrect rankings. 111 . Analyzing and Predicting Email Communication Jul 2000 Nov 2000 Mar 2001 Jul 2001 Nov 2001 Mar 2002 Jul 2002 Nov 2002 Mar 2003 Jul 2003 − . − . . . . Jul 2000 Nov 2000 Mar 2001 Jul 2001 Nov 2001 Mar 2002 Jul 2002 Nov 2002 Mar 2003 Jul 2003 − . . . . Jul 2000 Nov 2000 Mar 2001 Jul 2001 Nov 2001 Mar 2002 Jul 2002 Nov 2002 Mar 2003 Jul 2003 − . . . . Figure 6.4: Kendall’s τ over time, between CG and EC (orange), CG and CG+EC (green),and EC and CG+EC (blue). In this chapter, we presented a novel hybrid model for email recipient prediction thatleverages both the information from an email network’s communication graph and thetextual content of the messages. Our model starts from scratch, in that it does not assumeor need seed recipients, and it is updated for each email sent. The proposed model achieveshigh performance on two email collections. We have answered RQ4:

RQ4

Can we predict email communication through modeling email content and commu-nication graph properties?We show that the communication graph properties and the email content signals bothprovide a strong baseline for predicting recipients, but are complementary, i.e., combiningboth signals achieves the highest accuracy for predicting the recipients of email.We have shown that the number of received emails is an effective method for estimatingthe prior probability of observing a recipient, and the number of emails sent between twousers is an effective way of estimating the “connectedness” between these two users, andproves to be a helpful signal in ranking recipients.An implication of the ﬁndings in this chapter is that both the context in which digitaltraces are created, and the content of the digital traces are valuable signals in predicting112 .6. Conclusion real-world behavior, in the case of email communication. This has implications forboth designing systems to predict recipients, as it does for analyzing and understandingcommunication ﬂows in an enterprise, e.g., in a discovery scenario where one may wantto discover “unexpected” communication.The work presented in this chapter has several limitations, ﬁrst, from a modelingperspective; We identiﬁed characteristic weaknesses of our individual models’ robustnessto speciﬁc circumstances. As witnessed by the decrease in performance for highly activeusers, our email content model seems unﬁt to deal with users that exchange a large numberof emails with a large number of people. The deteriorating performance over time showsthat our communication graph models’ performance suffers with expanding graphs asthey develop.To address these issues, one possible direction for future work is to incorporate timeinto our model. We can, for example, use decay functions to weigh the edges betweenusers and promote more recent communication. Similarly, we can use time-dependentlanguage models that favor recent documents to incorporate time. Whereas in this workwe choose to keep a simple, and hence interpretable model, a machine learned methodmay improve prediction accuracy, e.g., by using the signals we employed as features.Finally, for future work, the presented model can be applied to novel tasks, e.g., role oranomaly detection in email corpora, or leak prevention.In the next chapter, we take a similar approach to the work presented in this chapter.More speciﬁcally, we analyze a novel dataset of interaction logs, and address a predictiontask. Much like the recipient recommendation task in this chapter, we study the contri-bution of the context and the content of the digital traces, with the goal of predictingreal-world activity of our entities of interest. 113

Analyzing and Predicting Task Reminders “The advantage of a bad memory is that one enjoysseveral times the same good things for the ﬁrst time.”—

Friedrich Nietzsche

Our second case-study into the link between digital traces and real-world behavior oractivities of entities of interest, revolves around mobile devices. Mobile devices are anespecially interesting source of digital traces, as the closeness, portability, and popularityof our smartphones means they are part of most of the aspects of our daily lives. Mobiledevice hence take a central role as potential “containers for evidence” in the E-Discoveryscenario [40]. Smartphones may carry signals of our whereabouts, context, and activitiesin many different forms, e.g., from emails, (voice) messages, call and SMS logs, tobrowser data, and other data logs [16].A relatively new source of digital traces that can provide rich insights into users’activities and behavior are automated personal assistants. These intelligent assistantshave been introduced in major online service offerings and mobile devices over the lastseveral years. Systems such as Siri, Google Now, Echo, M, and Cortana support a rangeof reactive and proactive scenarios, ranging from question answering to alerting aboutplane ﬂights and trafﬁc. The embedding of these personal assistants in our day-to-day lifemakes them a rich potential resource of “digital evidence,” i.e., digital traces that can beemployed for ﬁnding evidence of activity in the real world [197].Several of these personal assistants provide reminder services aimed at helping peopleto remember plans for future tasks they may otherwise forget. Potentially representingthe user’s (future) location, activity, or planned tasks, interaction logs with these typesof reminder services could serve as a valuable signal in an E-Discovery scenario. In thischapter, we study a large-scale log of user-created reminders from Microsoft Cortana,to better understand the context under which digital traces are produced, and to studywhether we can predict the (planned) behavior of the entities of interest of Part II: theproducers of digital traces.Motivated by the ﬁndings in the previous chapter, in this chapter too we study theimpact of the contexts in which the traces are created. Here, we consider the creation timeof a reminder as the context. In addition, we study the predictive power of the textual content of the digital traces, by incorporating the reminder text for predicting the remindertask’s execution time. 115 . Analyzing and Predicting Task Reminders

Table 7.1: Example interaction sequence for setting a reminder.

Turn Who Text “Can weidentify patterns in the times at which people create reminders, and, via notiﬁcation times,when the associated tasks are to be executed?”

We ﬁrst investigate user behavior around reminder creation by studying the commontasks linked to setting reminders. In this section, we focus on the question: “Is therea body of common tasks that underlie the reminder creation process?”

Understandingthe common tasks allows us to better see common usage patterns, which will help inpredictive models and in ﬁnding unexpected behavior, e.g., through outlier detection. Toanswer this question, we perform a data-driven qualitative analysis of Cortana reminderlogs. Speciﬁcally, we extract reminders that are observed frequently and across multipleusers. Then, we employ a manual labeling strategy, and categorize the reminders in a tasktaxonomy to better understand the task types that drive users to create reminders.

Before we analyze the different tasks at the root at reminder setting, we describe thecomposition of a typical reminder. In the left column of Table 7.2, we present threeexamples of common reminders. The examples show a structure that is frequently observedin the logged reminders. Reminders are typically composed as predicate sentences. Theycontain a phrase related to an action that the user would like to perform (typically a verbphrase) and a referenced object that is the target of the action to be performed.116 .1. Reminder Types

Table 7.2: Example reminders as predicates.

Reminder Predicate “Remind me to take out the trash” Take out (me, the trash)“Remind me to put my clothes in dryer” Put (me, clothes in dryer)“Remind me to get cash from the bank” Get (me, cash from the bank)

A session for setting a reminder consists of a dialog, where the user and the intelligentassistant interact in multiple turns. Typically, the user starts by issuing the commandfor setting a reminder, and dictates the reminder. Optionally, the user speciﬁes thereminder’s notiﬁcation time. Next, the assistant requests to specify the time (if the userhas not yet speciﬁed it), or provides a summary of the reminder, i.e., the task descriptionand notiﬁcation time, asking the user to conﬁrm or change the proposed reminder (seeTable 7.1).We analyze a sample of two months of Cortana reminder logs, spanning all of Januaryand February 2015. We ensure the data used in this chapter preserves user privacy, as weﬁlter the data to patterns commonly observed across multiple users, and study behavior inaggregate, refraining from studying individual patterns. In summary, we are not lookingat any user’s behavior at the individual level, but across a large population, to uncoverbroad and more general patterns.We pre-process this set of reminders by including only reminders from the US market(the only market which had Cortana enabled on mobile devices at that time). To narrowthe scope of our analysis, we focus on time-based reminders and remove location (e.g., “remind me to do X when I am at Y” ) and person-based reminders (e.g., “remind me to giveX when I see Y” ), which are less common and more challenging to study across users dueto their personal nature. Finally, we retain only reminders that are conﬁrmed by the user(turn 6 in Table 7.1). The resulting sample contains 576,080 reminders from 92,264 users.For each reminder, we extract the reminder task description and notiﬁcation time fromCortana’s summary (turn 4 in Table 7.1). We also extract the creation time based on thelocal time of the user’s device. Each reminder is represented by: r task The textual task description of the reminder, i.e., the phrase which encodes thefuture task or action to be taken, as dictated by the user. We extract the text fromCortana’s ﬁnal summary response (“do the laundry” from turn 4 in Table 7.1). r CT The creation time of the reminder. This represents the time at which the userencodes the reminder. We represent r CT as a discretized time-value; Section 7.3.1deﬁnes the discretization process. We extract this timestamp from the client’sdevice. r NT The notiﬁcation time set for the reminder to ﬁre an alert. This data represents thetime at which the user wishes to be reminded about a future task or action. Werepresent r NT in the same discretized manner as r CT . We extract the notiﬁcationtime from Cortana’s summary response (turn 4 in Table 7.1). 117 . Analyzing and Predicting Task Reminders r ∆ T Subtracting the creation time from the notiﬁcation time yields the time-delta, thedelay between the reminder’s creation and notiﬁcation time. Intuitively, reminderswith smaller time-deltas represent short-term or immediate tasks ( “remind me totake the pizza out of the oven” ), whereas reminders with larger time-deltas representtasks planned further ahead in time ( “remind me to make a doctor’s appointment” ). To understand the common needs that underlie the reminder creation process, and hencecommon user behavior, we ﬁrst identify common reminders, i.e., reminders that arefrequently observed across multiple users. Studying common reminders can aid analystsin understanding common usage patterns, and hence can help in identifying patterns orreminders that diverge. In addition, common usage patterns can aid system designers increating predictive models. We employ a mixed methods approach, comprising data-drivenand qualitative methodologies, to extract and identify common task types.

Frequent Task Description Extraction

First, we extract common task descriptions, by leveraging the predicate (verb+object)structure described at the start of this section. To ensure the underlying task descriptionsrepresent broad tasks, we ﬁlter to retain only descriptions that start with a verb (or amulti-word phrasal verb) that occurs at least 500 times, across at least ten users, withat least ﬁve objects. This yields a set of 52 frequent verbs, which covers 60.9% of thereminders in our sample. The relatively small number of verbs that cover the majority ofreminders in our log indicates there likely exists a large ‘head’ of common tasks that giverise to reminder creation. To analyze the underlying tasks, we include the most commonobjects, by pruning objects observed less than ﬁve times with a verb. This yields a set of2,484 unique task descriptions (i.e., verb+object pairs), covering 17.9% of our sample log.

Manual Labeling

Next, we aim to identify the tasks which underlie the frequent task descriptions, andcategorize them into a task type taxonomy. Speciﬁcally, by manual inspection, weidentiﬁed several key dimensions that separate tasks. In particular, we ﬁnd that dimensionsthat commonly separate tasks are (i) the cognitive load the task incurs (i.e., how muchdoes the task represent an interruption of the user’s activity), (ii) the context in whichthe task is to be executed (i.e., at home, at work), and (iii) the (expected) duration of thetask. This enabled us to categorize the extracted frequent task descriptions into one ofsix broader task types, with several subclasses, which we present in the following sectionwith example reminders.118 .2. Task Type Taxonomy

Go somewhereChoresCommunicateManage OngoingExternal ProcessManage OngoingUser ActivityEat/Consume

Switch context Run errandRecurring StandaloneGeneral CoordinateStartPrepare Stop

Figure 7.1: Bar plot showing the proposed reminder task-type taxonomy.

In this section, we describe each of the six task types in turn, and provide examples of theassociated verb+object patterns. First, for an overview of the task types and their relativeoccurrence in our data, see Figure 7.1. The example objects are shown in decreasing orderof frequency, starting with the most common. Note that verbs are not uniquely associatedwith a single task type, but the verb+object-pair may determine the task type (compare,e.g., “set alarm” to “set doctor appointment” ). 119 . Analyzing and Predicting Task Reminders

1. Go Somewhere (33.0%)

One third of the frequent tasks refer to the user moving from one place to another. SeeTable 7.3. We distinguish between two subtypes: the ﬁrst subtype is running an errand(83.2%), where the reminder refers to executing a task at some location (e.g., “pick upmilk”). Running an errand represents an interruption of the user’s activity, but a task of arelatively small scale, i.e., it represents a task that brieﬂy takes up the user’s availability.The second subtype is more comprehensive, and represents tasks that are characterized bya switch of context (16.8%), e.g., moving from one context or activity to another (“go towork”, “leave for ofﬁce”), which has a larger impact on the user’s availability.Table 7.3: Go Somewhere.

Run errand grab [something] laundry, lunch, headphonesget [something] batteriespick up [something/someone] laundry, person, pizzabuy [something] milk, ﬂowers, coffee, pizzabring [something] laptop, lunch, phone chargerdrop off [something] car, dry cleaning, prescriptionreturn [something] library books

Switch context leave (for) [some place] house, work, airportcome [somewhere] home, back to work, inbe [somewhere] be at work, at homego (to) [somewhere] gym, work, home, appointment,stop by [some place] the bank, at Walmarthave (to) [something] work, appointment120 .2. Task Type Taxonomy

2. Chores (23.8%)

The second most common type of reminders represent daily chores. See Table 7.4. Wedistinguish two subtypes: recurring (66.5%) and standalone chores (33.5%). Both typesrepresent smaller-scale tasks that brieﬂy interrupt the user’s activity.Table 7.4: Chores.

Recurring take out [something] trash, binsfeed [something] dogs, meter, cats, babyclean [something] room, house, bathroomwash [something] clothes, hair, dishes, carcharge [something] phone, ﬁtbit, batteriesdo [something] laundry, homework, taxes, yogapay [something] pay rent, bills, phone billset [something] alarm, reminder

Standalone write [something] a check, letter, thank you notechange [something] laundry, oil, air ﬁltercancel [something] amazon prime, netﬂixorder [something] pizza, ﬂowersrenew [something] books, driver’s license, passportbook [something] hotel, ﬂightmail [something] letter, package, checksubmit [something] timesheet, timecard, expensesﬁll out [something] application, timesheet, formprint [something] tickets, paper, boarding passpack [something] lunch, gym clothes, clothes 121 . Analyzing and Predicting Task Reminders

3. Communicate (21.1%)

Next, a common task is to remind to contact ( “call,” “phone,” “text” ) another individual,either a person (e.g., “mom,” “jack,” “jane” ), organization/company ( “AT&T” ), or other( “hair dresser,” “doctor’s ofﬁce” ). See Table 7.5. We identify two subtypes: the majorityreﬂects general, unspeciﬁed communication (94.7%) (e.g., “call mom” ), and a smallerpart (5.3%) represents coordination or planning tasks (e.g., “make doctor’s appointment” ).Both subtypes represent tasks that brieﬂy interrupt the user’s activity.Table 7.5: Communicate

General send [something] email, text, reportemail [someone] dad, momtext [someone] mom, dadcall [someone] mom, dadtell [someone] [something] my wife I love her, happybirthday mom

Coordinate set [an appointment] doctors appointmentmake [an appointment] doctors appointment, reservationschedule [an appointment] haircut, doctors appointment

4. Manage Ongoing External Process (12.9%)

These reminders represent manipulation of an ongoing, external process, i.e., tasks wherethe user monitors or interacts with something, e.g., the laundry or oven. See Table 7.6.These tasks brieﬂy interrupt a user’s activity.Table 7.6: Manage Ongoing External Processturn [on/off] [something] water, oven, stove, heatercheck [something] email, oven, laundry, foodstart [something] dishwasher, laundryput [something] in [something] pizza in oven, clothes in dryertake [something] out pizza, chicken, laundry122 .2. Task Type Taxonomy

5. Manage Ongoing User Activity (6.3%)

This class of reminders is similar to the previous class, however, as opposed to the userinteracting with an external process, these reﬂect a change in the user’s own activity.They incur a higher cost on the user’s availability and cognitive load. See Table 7.7. Wedistinguish three subtypes: preparing (31.4%), starting (61.4%), and stopping an activity(7.2%). Table 7.7: Manage Ongoing User Activity

Activity/Prepare get ready [to/for] work, home

Activity/Start start [some activity] dinner, cooking, studyingmake [something] food, breakfast, grocery listtake [something] a shower, a breakplay [something] game, xbox, basketballwatch [something] tv, the walking dead, seahawks game

Activity/Stop stop [some activity] reading, playingﬁnish [something] homework, laundry, taxes

6. Eat/Consume (2.8%)

Another frequent reminder type refers to consuming something, most typically food( “have lunch” ) or medicine ( “take medicine” ). These are also small and range from briefinterruptions ( “take pills” ) to longer interruptions ( “have dinner” ). See Table 7.8.Table 7.8: Eat/Consumetake [something] Medicineeat [something] lunch, dinner, breakfast, pizzahave [something] lunch, a snack, breakfast 123 . Analyzing and Predicting Task Reminders

Next, we study the temporal patterns of reminders. We seek to understand when peoplecreate reminders, when reminders are set to notify users, and the average delay betweencreation and notiﬁcation time for different reminders. Such knowledge could prove usefulin providing likelihoods about when certain tasks tend to happen or predicting (follow-up)tasks. In this section, we focus on the research question: “Can we identify patterns in thetimes at which people create reminders, and, via notiﬁcation times, when the associatedtasks are to be executed?”

We study reminders on several levels of granularity. In Section 7.3.2 we look at globalpatterns and trends across all reminders. Next, we study temporal patterns per task typein Section 7.3.3. In Section 7.3.4, we perform a temporal analysis of task descriptionterms. Finally, we study the relation between reminder creation and notiﬁcation times inSection 7.3.5. First, we explain how we represent the reminder’s creation and notiﬁcationtime to enable our analyses.

To study common temporal patterns of reminders, we discretize time by dividing each dayinto the following six four-hour buckets: (i) late night [00:00-04:00), (ii) early morning[04:00-08:00), (iii) morning [08:00-12:00), (iv) afternoon [12:00-16:00), (v) evening[16:00-20:00), and (vi) night [20:00-00:00). By combining this time-of-day division withthe days of week we yield a 7 by 6 matrix M , whose columns represent days, and rowstimes. Each r CT and r NT can be represented as a cell in matrix M , i.e., M i,j where i corresponds to the day of week, and j to the time of day. Furthermore, we distinguishbetween M CT and M NT , the matrices whose cells contain reminders that are created,or respectively set to notify, at a particular day and time. We represent each reminder asan object, with the attributes described in Section 7.1.2: the reminder’s task description( r task ), creation time ( r CT ), notiﬁcation time ( r NT ), and time-delta ( r ∆ T ). To studythe temporal patterns, we look at the number of reminders that are created, or whosenotiﬁcations are set, per cell. We compute conditional probabilities over the cells in M CT and M NT , where the reminder? creation or notiﬁcation time is conditioned on the tasktype, time, or the terms in the task description. P ( r CT | w ) = |{ r ∈ R : w ∈ r task ∧ r CT = X }||{ w ∈ r task , r ∈ M CT }| (7.1) P ( r NT | w ) = |{ r ∈ R : w ∈ r task ∧ r NT = X }||{ w ∈ r task , r ∈ M NT }| (7.2)To estimate the conditional probability of a notiﬁcation or creation time, given a termfrom the task description, we take the set of reminders containing term w that are createdor whose notiﬁcation is set at time X , over the total number of reminders that contain theword (see equation 7.1 and equation 7.2). By computing this probability for each cell in M NT or M CT , (i.e., Σ i,j ∈ M P ( r NT = i, j | w ) ) we generate a probability distributionover matrix M .124 .3. Reminder Patterns P ( r NT = X | r CT = i, j ) = (cid:88) i,j ∈ M CT |{ r ∈ M CTi,j : r NT = X }|| M CTi,j | (7.3)To study the common patterns of the periods of time between the creation of reminders andnotiﬁcations, we estimate a probability distribution for a reminder’s notiﬁcation time givena creation time (see equation 7.3). We compute this probability by taking the reminders ineach cell of M CT that have their notiﬁcations set to ﬁre at time X , over all the remindersin that cell.Finally, we study the delays between setting and executing reminders, by collectingcounts and plotting histograms of r ∆ T of reminders for a given subset, e.g., { r CT ∈ R : w ∈ r } or { r CT ∈ R : r CT = X } . We now describe broad patterns of usage, and answer the following questions: “At whichtimes during the day do people plan (i.e., create reminders), and at which times do theyexecute tasks (i.e., reminder notiﬁcation trigger)?” and “How far in advance do peopleplan tasks?”

To answer these questions, we examine the temporal patterns in our log dataover the aggregate of all reminders in the two-month sample (576,080 reminders).Figure 7.2 shows the prior probability of a reminder’s creation time, P ( r CT ) , andnotiﬁcation time, P ( r NT ) , in each cell in M CT and M NT . Looking at Figure 7.2, wesee that in our sample, planning (reminder creation) most frequently happens later in theday, more so than during ofﬁce hours (morning and midday). This observation couldbe explained by the user’s availability; users may have more time to interact with theirmobile devices in the evenings. Additionally, the end of the day is a natural moment for“wrapping up the day,” i.e., looking backward and forward to completed and future tasks.Turning our attention to notiﬁcation times, the right plot of Figure 7.2 shows a slightlydifferent pattern: people execute tasks (i.e. notiﬁcations trigger) throughout the day, frommorning to evening. This shows that users want to be reminded of tasks throughoutthe day, in different contexts (e.g., both at home and at work). This is reﬂected in ourtask-type taxonomy, where tasks are related to both contexts. We also note how slightlymore notiﬁcations trigger on weekdays than in weekends, and more notiﬁcations triggerat the start and end of the workweek than midweek and in weekends. This observationmay be attributed to the same phenomenon for reminder creation; users may tend toemploy reminders for activities that switch between week and weekend contexts. Finally,comparing the two plots shows the notiﬁcation times are slightly less uniformly distributedthan creation times, e.g., users create reminders late at night, when it is relatively unlikelyfor notiﬁcations to ﬁre.Next, to determine how far in advance users typically plan, we look at the delaysbetween reminder creation and notiﬁcation in Figure 7.3. The top plot shows distinctspikes around ﬁve-minute intervals, which are due to reminders with a relative timeindication (e.g., “remind me to take out the pizza in 5 minutes” ). These intervals are morelikely to come to mind than more obscure time horizons (e.g., “remind me to take outthe pizza at 6.34pm” ). The second and third plots clearly illustrate that the majority ofreminders have a short delay: around 25% of the reminders are set to notify within the125 . Analyzing and Predicting Task Reminders Figure 7.2: Distribution of reminder creation times (left plot) and reminder notiﬁcationtimes (right plot) for all reminders in two-month sample ( n = 576,080)same hour (second plot), and around 80% of the reminders are set to notify within 24hours (third plot). Interestingly, there is a small hump around 8–9 hours in the secondplot, which may be explained by reminders that span a night, e.g., created at the end of theday, to notify early the next day), or a ‘working day’ (reminder creation in the morning,notiﬁcation at the end of the day).In summary, we have shown that on average people tend to set plans in the evening,and execute them throughout the day. Furthermore, most tasks that drive reminder settingare for short-term tasks to be executed in the next 24 hours. In this section, we explore whether different task types are characterized by distincttemporal patterns that differ from the global patterns seen in the previous section. To doso, we take the subset of the 2,484 unique frequent reminders in our two-month sample ofreminders. We extract all reminders that match the exact reminder task descriptions fromour full dataset, and yield a subset of 125,376 reminders with task type-labels, that we usefor analysis. We aim to answer the same questions raised in the previous section, but atthe level of task type, as opposed to a characterization of the global aggregate.

Creation and notiﬁcation times.

First, we look at the probability distribution of re-minder creation times per task type, i.e., P ( r CT | tasktype ) . Looking at the distributionsfor each task type, we discover two broader groups: per task type, reminders are eithercreated mostly in morning and midday blocks (roughly corresponding to ofﬁce hours), oroutside these blocks. Figure 7.4 shows examples of both types: “Activity” and “Goingsomewhere” reminders are mostly created during ofﬁce hours, while e.g., “Communicate”and “Chore” reminders are more prone to be created in evenings. Another interestingobservation is that activity-related reminders are comparatively frequent on weekends.126 .3. Reminder Patterns Figure 7.3: Histograms of delays (in minutes, hours, and days from top to bottom) betweenreminder creation and notiﬁcation times.Next, we study reminder notiﬁcation times per task type, i.e., P ( r NT | tasktype ) . Here,a similar pattern emerges. Broadly speaking, there are two types of tasks: those set tonotify during ofﬁce hours, and those that trigger outside these hours. See Figure 7.5for examples. “Communicate” and “Go” fall under the former type, whereas “Chore”and “Manage ongoing process” fall under the latter. The nature of the tasks explainsthis distinction: the former relate to work-related tasks (communication, work-relatederrands), whilst the majority of the latter represent activities that are more common in ahome setting (cooking, cleaning).Taking a closer look at the Communicate task subclasses in Figure 7.6, we show127 . Analyzing and Predicting Task Reminders Figure 7.4: Reminder creation time probability distributions over time, for different tasktypes.how “Communicate/General” and “Communicate/Coordinate” differ: the former is moreuniformly distributed, whilst the latter is denser around ofﬁce hours. The general subtasktoo has comparatively more reminders that trigger in weekends, whereas coordinate ismore strongly centered on weekdays. The distinct patterns suggest these subclasses indeedrepresent different types of tasks.

Reminder creation and notiﬁcation delay.

To better understand differences in thelead times between reminder creation and notiﬁcation, we present an overview of thedistribution of reminder delays per task type in Figure 7.7. In general, the lower theboxplot lies on the y -axis, the lower the lead time, i.e., the shorter the delay betweencreating the reminder and executing the task. It is worth comparing, e.g., the plot of“Manage ongoing process,” to both “Go” or “Communicate” task types: execution ofmanaging ongoing processes tasks seems to be planned with a much shorter lead timethan the other types of task. Considering the nature of the tasks, where ongoing processesoften represent the close monitoring or checking of a process (e.g., cooking or cleaningtasks), it is understandable that the delays are on the order of a few minutes, rather than128 .3. Reminder Patterns Figure 7.5: Reminder notiﬁcation time probability distributions over time, per task type.hours. “Communicate/Coordinate” has the largest delay on average, i.e., it is the task typepeople plan furthest in advance.A more detailed examination of the differences between the “Communicate” subtasks,illustrated in Figure 7.8, reveals that “Communicate/General” subtasks are more likelyto be executed with lower lead time, as noted by the peak at hour 0 in the top plot. The“Communicate/Coordinate” subtask is about as likely to be executed the next day, as seenby the high peak around the 12 hour mark in the bottom plot. Much like the observationsmade in the previous section, the difference in the patterns between both “Communicate”subtasks suggests that the distinction between the subtypes is meaningful. Differences arenot only found on a semantic level through our qualitative analysis, but also in temporalpatterns.In summary, we have shown how task type-speciﬁc temporal patterns differ from theaggregate patterns in Section 7.3.2.

One can hypothesize that the terms in task descriptions show distinct temporal patterns,i.e., reminders that contain the term “pizza” are likely to trigger around dinner time.129 . Analyzing and Predicting Task Reminders

Figure 7.6: “Communicate” subclass notiﬁcation time probability distributions over time.Presence of these temporal patterns may be leveraged for reminder creation or notiﬁcationtime prediction. To study this, we manually inspected the temporal distribution of taskdescriptions’ terms of the 500 most frequent terms. More speciﬁcally, we computeconditional probabilities for a cell in M CT or M NT given a term w (see equation 7.1 andEq. 7.2). We found several intuitive patterns, which we illustrate below with examples.These are simply presented to motivate the intuition behind the term features used in ourpredictive model in Section 7.4.Figure 7.9 shows creation and notiﬁcation times of task descriptions that containthe terms “church” or “appointment.” The “appointment” plot shows a strong patternaround the morning and midday blocks, representing ofﬁce hours. Reminders that contain“church” show a clear and intuitive pattern too; they are largely created from Saturday nightthrough Sunday morning, and are set to notify on Sunday early morning and mornings.When we examine the delays between reminder creation and notiﬁcation, clear patternsemerge.In Figure 7.10, we compare the average delays of reminders containing the term“appointment” to “laundry.” Clearly, on average, “appointment” reminders have longerdelays, reﬂected by the nature of the task (which may involve other individuals and hencerequire more planning), whereas “laundry” reminders are more likely to reﬂect short-termtasks (which may be performed individually).In summary, we see distinct temporal patterns in task descriptions’ terms. In Sec-tion 7.4, we study the generalizability of these patterns. Finally, we look at correlations between reminders’ creation and notiﬁcation times. Mo-tivated by the observation that most reminders are set to notify shortly after they arecreated, we study the probability of a reminder’s notiﬁcation time given its creation time, P ( r NT | r CT ) .See Figure 7.11 for examples. Looking at the plots in detail, we see how reminders130 .4. Predicting Notiﬁcation Time A ll r e m i nd e r s C ho r e ( r ec u rr i ng ) C ho r e ( s t a nd a l on e ) C o m m . ( c oo r d i n a t e ) C o m m . ( g e n e r a l ) E a t / C on s u m e G o ( s w i t c h c on t e x t ) G o ( e rr a nd ) M a n a g e a c t i v i t y M a n a g e p r o cess D e l a y ( hou r s ) Figure 7.7: Boxplot showing delay between reminder creation and notiﬁcation times ( n =125,376).across different creation times appear similar: they are most likely to have their notiﬁcationﬁre within the same cell or the next, conﬁrming earlier observations that the majority ofreminders are short-term (i.e., same cell). However, upon closer inspection, we see thatas the reminders’ creation time moves towards later during the day, reminders are morelikely to be set to notify the next day. Furthermore, in the third plot from the left, wesee how reminders created on Friday evenings have a small but substantial probability ofhaving their notiﬁcation ﬁre on Monday morning (i.e., the reminder spans the weekend).These patterns show how delay between reminder creation and notiﬁcation time is low onaverage, but the length of delay is not independent from the creation time.In summary, we have shown distinct temporal patterns of reminders of different tasktypes, and of the terms in task descriptions. Finally, we have shown that a reminder’snotiﬁcation time is most likely set shortly after creation time, but the later in the day areminder is created, the more likely the notiﬁcation time is further in the future. In the previous section, we have shown temporal patterns in reminder creation andnotiﬁcation time of four types: aggregate patterns, task type-related, term-based, andtime-based. To study whether these patterns can be effectively harnessed, we address aprediction task. Speciﬁcally, we turn to the task of predicting the day of the week in whicha task is most likely to happen (i.e., predicting r NT ). Motivated by our observation that the131 . Analyzing and Predicting Task Reminders Delay in minutes

Delay in hours F r a c t i o n o f r e m i n d e r s Delay in days

Delay in minutes

Delay in hours F r a c t i o n o f r e m i n d e r s Delay in days

Figure 7.8: Delay (lead time) between reminder creation and notiﬁcation for “Communi-cate” subtasks. Showing “Communicate/General” in the top plot, “Communicate/Coordi-nate” in the bottom.majority of the reminders are set to trigger soon after being set (Section 7.3.2 and 7.3.5),and by the patterns we observed of the task descriptions’ terms (Section 7.3.4), we aim toanswer the following research questions: “Is the reminder’s creation time indicative of itsnotiﬁcation time?” and “Do term-based features yield an increase in predictive power?”

The aim of our experiments is not to develop complex or novel predictive models, butinstead to study whether the patterns discussed earlier generalize so as to contribute tooverall predictive performance.We cast the task of predicting the day of week a reminder is set to notify as a multiclassclassiﬁcation task, where each day of the week corresponds to a class. The input to ourpredictive model is the reminder’s task description ( r task ), creation time ( r CT ), and thetarget class is the notiﬁcation time ( r NT ) day of week. We measure the predictive powerof the patterns identiﬁed in the previous sections via term-based and (creation) time-basedfeatures. Speciﬁcally, for term-based features, we extract bag of word features (unigrams),and our time-based features correspond to R CT ’s time of day (row) and day of week(column), and the minutes since the start of week. See Table 7.9 for an overview. We use Gradient Boosted Decision Trees for classiﬁcation. This method has proven tobe robust and efﬁcient in large-scale learning problems [90]. The ability of this methodto deal with non-linearity in the feature space and heterogeneous features makes it anatural choice. To address the multiclass nature of our problem, we employ a one vs. all132 .4. Predicting Notiﬁcation Time

Figure 7.9: Creation and notiﬁcation times for reminders with the terms “church” (toprow) and “appointment” (bottom row).classiﬁcation strategy, where we train seven binary classiﬁers and output the predictionwith the highest conﬁdence as ﬁnal prediction. We compare the accuracy to randomlypicking a day of the week (with accuracy of = 0 . ) and to a more competitivebaseline that predicts the notiﬁcation will be for the same day that the notiﬁcation wascreated ( BL-SameDay ).For the experiments, we sample six months of data (January through June 2015).?Alldata were ﬁltered according to the process described in Section 7.1.2, resulting in a totalof 1,509,340 reminders.?We split this data sequentially: the ﬁrst 70% (approx. January1 to May 7) forms the training set and the last 30% (approx. May 8 to June 30) formsthe test set.?We use the ﬁrst two months of the training set for the analysis described inSections 7.1 and 7.3 as well as for parameter tuning before we retrained on the entiretraining set.? In the next section, we report predictive performance on the held out test set,speciﬁcally, we report macro and micro-averaged accuracy over the classes (

Macro and

Micro , respectively). We compare three approaches: one that leverages time featuresbased on the reminder’s creation time (

Time only ), one with term features (

Terms . Analyzing and Predicting Task Reminders

Delay in minutes

Delay in hours F r a c t i o n o f r e m i n d e r s Delay in days

Delay in minutes

Delay in hours F r a c t i o n o f r e m i n d e r s Delay in days

Figure 7.10: Delays between reminder creation and notiﬁcation for reminder task descrip-tions containing the terms “appointment” (left column) and “laundry” (right column). only ), and ﬁnally a model that leverages both types of features (

Full model ). Wetest for statistical signiﬁcance using t -tests, comparing our predictive models against BL-SameDay . The symbols (cid:78) and (cid:72) denote statistically signiﬁcant differences (greaterthan the baseline and worse than the baseline, respectively) at α = 0.01. Table 7.10 shows the results of our prediction task. First, we note that the baseline ofpredicting the notiﬁcation time to be the same day as the creation time, at 0.5147 micro-averaged accuracy, performs much better than random (at 0.1429). This indicates usersmostly set reminders to plan for events in the short-term. Next, we see that the

Time .5. Conclusion

Figure 7.11: Reminder notiﬁcation time probability distributions over time (i.e., P ( r NT | r CT ) ), for three different r CT .Table 7.9: Features used for prediction. Features Description

Term Features Unigram Bag of Word featuresTime features r CT Time of day r CT Day of week r CT Minutes since start of week only model with a micro-averaged accuracy of 0.6279 signiﬁcantly improves over thebaseline, indicating that the reminder creation time helps further improve predictionaccuracy. As noted earlier, tasks planned late at night are more likely to be executedon a different day, and the use of creation time helps leverage this and more generalpatterns. Finally, the model that uses only features based on the task description (

Termsonly ) performs better than random, but does not outperform the baseline. However,when combined with the time model (

Full model ) we see an increase of 8.2% relativeto the time only model. We conclude that the creation time provides the most informationfor predicting when a task will be performed, but the task description provides signiﬁcantadditional information. Framed differently, we show that both the context in which digitaltraces are produced (i.e., the time of day), and the content of the digital traces (i.e., thetextual description of the task) provide signals for better predicting real-world behavior ofthe producers of digital traces.

In this chapter, we have performed a large-scale analysis of reminder data from useractivity in a natural setting to answer the following research question:

RQ5

Can we identify patterns in the times at which people create reminders, and, vianotiﬁcation times, when the associated tasks are to be executed? 135 . Analyzing and Predicting Task Reminders

Table 7.10: Average accuracy of day-of-week prediction task. Statistical signiﬁcancetested against

BL-SameDay . Run Micro Macro Error reduction

Full model (cid:78) (cid:78) +0.3381

Time only (cid:78) (cid:78) +0.2333

Terms only (cid:72) (cid:72) -0.6944

BL-SameDay

Task type taxonomy.

Through our log analyses we have shown common remindersand made an attempt at identifying and categorizing the types of tasks that underlie them.We identiﬁed that the majority of reminders in our sample refer to either daily householdchores, running errands, or switching contexts.

Temporal patterns.

Next, we show how reminders display different temporal patternsdepending on the task type they represent, the reminder’s creation time, and the terms inthe task description. Most notably, we have shown that people mostly plan for tasks inthe evening hours, and most tasks are set to execute in mornings and throughout the day.On average, people plan in advance for the short-term, i.e., 80% of the reminders in oursample log are set to trigger on the same day as reminder creation.

Prediction task.

Finally, we demonstrated that we can leverage these patterns to predictthe day of the week that a reminder is most likely to trigger, i.e., the day the task is mostlikely to be executed. Speciﬁcally, we conﬁrm that the reminder’s creation time is a strongindicator of notiﬁcation time, but that including the task description further improvesaccuracy over the strongest baseline, with a 33% reduction in error.In line with our ﬁndings in the previous chapter, here we conﬁrm that both the context in which digital traces are produced (the time of day), and the (textual) content of thedigital traces (the reminder description) provide valuable signals in predicting the real-world behavior and activity of our entities of interest: the producers of digital traces.Furthermore, the ﬁndings in this chapter have implications for designing systems to helpwith task completion, and more generally for developing technology to reduce prospectivememory failures. The analysis and prediction task show that we can leverage large-scaleinteraction logs to predict a users’ (planned) activities.There are several limitations in our log analyses. First, we performed this analysis on aspeciﬁc subset of reminders: reminders from one geographic locale and for a single type ofreminder: time-based. There are opportunities to understand cultural and linguistic factorsin reminder creation by considering reminders from multiple regions. We additionally seek136 .5. Conclusion to investigate other types of reminders, such as those involving people, places, and events.Second, it is difﬁcult to quantify the comprehensiveness of the task type taxonomy, whichcovers common reminders. The ontology may therefore not cover more intricate, personal,speciﬁc, or complex reminders, the nature of which needs to be better understood. Finally,our approach and analysis is entirely log based. The taxonomy’s categories were manuallylabeled, and we make inferences and assumptions about the tasks that people are engagedin. User studies are needed to better understand the reminder process, including thegeneration and value of reminders, including how people behave when they are notiﬁed.Future work includes developing more sophisticated models (e.g., considering per-sonalized signals) to improve prediction performance. Besides the value of the insightsgained in this chapter from the discovery perspective, i.e., from understanding the linkbetween people’s interaction logs with intelligent assistants and their (planned) activitiesand tasks, the insights and predictions about tasks and the use of reminders can also provevaluable with the development of systems endowed with the ability to proactively reservetime, manage conﬂicts, remind people about tasks they might forget, and, more generally,help people achieve their goals. 137

Conclusions “Si ﬁnis bonus est, totum bonum erit.”—

Gesta Romanorum, Tale LXVII

In this thesis, we have studied discovery in digital traces. We have done so in two parts: inPart I we have studied the content of (textual) digital traces, and in Part II we have turnedto the contexts in which digital traces are created, with the goal of leveraging digital tracesfor predicting real-world activity.Part I revolves around semantic search. Semantic search is a search paradigm thatuses structured knowledge to increase retrieval effectiveness [250], and allows us to bettersupport the exploratory search process that is inherent to the discovery process [268].More speciﬁcally, in this part our entities of interest are the real-world entities that occur(i.e., are mentioned) in digital traces. We leverage publicly available knowledge basessuch as Wikipedia and Freebase, and study emerging entities : entities that are not (yet)described in the knowledge base. This is motivated by the fact that in typical discoveryscenarios (e.g., investigative journalism or digital forensics), the entities of interest may notbe known a priori. In three chapters, we analyze the temporal patterns of emerging entitiesin online text streams (Chapter 3), predict them in social media streams (Chapter 4), andenrich their representations by leveraging “collective intelligence” (i.e., how people referto and search for the entities on the web), for increased retrieval effectiveness (Chapter 5).The second part of this thesis revolves around the contexts in which digital tracesare created. Here, we shift our focus, and our entities of interest are the producers ofdigital traces, i.e., the people who leave behind digital traces. Our goal is to predictreal-world activity from digital traces, a core task in E-Discovery [197]. We present twocase studies: in the ﬁrst, we study enterprise email collections, and we analyze aspects thatgovern communication between employees and aim to predict email communication byleveraging the communication graph and email content (Chapter 6). Next, we study userinteraction logs with a personal intelligent assistant (Microsoft Cortana), that representsthe users’ planned activities and tasks (Chapter 7). In this chapter, we show commonusage patterns of the reminder service, and show how we can leverage the uncoveredpatterns for predicting the most likely day at which a task will be executed.In this ﬁnal chapter of the thesis, we re-iterate the research questions we answered,and summarize our methods and ﬁndings in Section 8.1. In Section 8.2 we reﬂect onfuture work. 139 . Conclusions

Here, we summarize the methods, ﬁndings, and their implications. We do so by answeringthe research questions raised in Chapter 1.

Part I. Analyzing, Predicting, and Retrieving Emerging Entities

In the ﬁrst content chapter of this thesis (Chapter 3), we start our exploration of real-worldentities in digital traces with a large-scale analysis of how entities emerge. In Chapter 3we answer our ﬁrst research question:

RQ1

Are there common temporal patterns in how entities of interest emerge in onlinetext streams?To answer this question, we analyze a large collection of entity time series, i.e., time seriesof entity mentions in the lead time between an entity’s ﬁrst mention in online text streamsand its subsequent incorporation into the knowledge base.We apply an unsupervised hierarchical clustering method to uncover groups of entitiesthat exhibit different emergence patterns. We discover two main emergence patterns. First,roughly half of the entities we study emerge in a “bursty” fashion, i.e., they surface inonline text streams without a precedent, and blast into activity before they are incorporatedin the KB. Other entities display a “delayed” pattern, where they appear in public discourse,experience a period of inactivity, and then resurface before being incorporated into the KB.In addition to these emergence patterns, we found distinct differences in the emergencepatterns of different types of entities, suggesting that different entities of interest exhibitdifferent emergence patterns.The work presented in this chapter has several implications. Most notably, the analyseshave implications for designing systems to detect emerging entities before they are part ofthe KB. We have shown how entities tend to resurface multiple times before being addedto Wikipedia. This suggests that burst detection may be an effective way of “catching”emerging entity before they are part of the KB. Furthermore, we have shown that thedifferent streams in which the entities emerge are indicative of how fast the entity may beadded to the KB.The work presented in this chapter also know several limitations. First, evaluatingunsupervised clustering without labels is not trivial: Von Luxburg et al. [254] aptlyquestion whether clustering is an “art or science.” However, we do have indications tobelieve the clusters we generated are meaningful. First, the cluster signatures show distinctand statistically signiﬁcantly different patterns, whereas different groupings of time series(e.g., by stream or by entity types) did not yield any discernible patterns. Furthermore,the structure of the dendrogram (i.e., the hierarchical cluster tree) suggests there are clearsubgroups in the data.Another limitation to our study is related to the dataset used. First, we rely onautomatically generated annotations (FAKBA1 dataset), which means we cannot assume a100% accurate set of annotations. What’s more, the set of long-tail entities that are likelyin our subset of “emerging entities,” are known to be comparatively more difﬁcult to link,as most entity linking methods have a strong reliance on “popularity”-based features, suchas, commonness [181]. Next to the quality of the annotations, the coverage or selection of140 .1. Main Findings, Limitations, and Implications sources in the data may be a concern for how well our ﬁndings generalize to other domainsor datasets. Sampling bias, and e.g., the absence of popular social media platforms (suchas Facebook, Twitter, or Tumblr) means the ﬁndings may be different with other datasets.Finally, the cultural bias that is inherent to the dataset selection. By studying only Englishsources, we effectively studied how entities emerge in the English speaking world.The analysis of emerging entities in Chapter 3 shows that knowledge bases are nevercomplete: new entities may emerge as events unfold, but at the same time, long-tail,relatively unknown entities may also be added to a KB. As a follow-up to this chapter, weturn our attention to predicting newly emerging entities in Chapter 4. More speciﬁcally, wefocus on predicting mentions of emerging KB entities in social media streams, motivatedby the high pace and noisy nature that characterizes social media streams. Our methodleverages an entity linking system to automatically generate training data for a named-entity recognizer to learn to recognize KB entities. We answer:

RQ2

Can we leverage prior knowledge of entities of interest to bootstrap the discoveryof new entities of interest?To answer this question, we propose a novel unsupervised method for generating pseudo-training data to train a named-entity recognizer and classiﬁer (NERC) for predictingmentions of emerging entities that are likely to be added to a KB. The method includestwo sampling methods for selecting high quality training samples, the ﬁrst selects samplesbased on the textual quality of the sample, and the second leverages the conﬁdencescore of the entity linking system to select samples of which the entity linker is moreconﬁdent, in an attempt to reduce noisy samples. We measure the effectiveness of thesesampling methods by their impact on the accuracy of predicting newly emerging entities.Furthermore, we perform an additional analysis, where we study the impact of the amountof prior knowledge. We approximate this quantity of prior knowledge by randomlysampling different (size) reference knowledge bases, i.e., we leave out a part of theknowledge base, and study the prediction effectiveness.We ﬁnd that sampling by textual quality improves the performance of our NERCmethod and consequently our method’s performance in predicting emerging entities.Furthermore, we show that setting a higher threshold on the entity linking system’sconﬁdence score for generating pseudo-ground truth results in fewer labels but betterperformance. We show that the NERC is better able to separate noise from entities that areworth including in a knowledge base. Finally, we show that in the case of a small amountof prior knowledge, i.e., limited size of the available initial knowledge, our method is ableto cope with missing labels and incomplete data, as observed through its consistent andstable precision. This ﬁnding justiﬁes our proposed method that assumes incomplete databy design.The implication of our ﬁndings is that we our approach can effectively support theexploratory search process, by retrieving similar entities to those in a reference KB, thatare not yet part of the KB.A limitation of our work is the retrospective scenario we employ in our evaluation.For this reason, we cannot measure the impact of the novelty of entities of interest as theyemerge. 141 . Conclusions

Finally, in Chapter 5 we address the follow-up task of enriching entity representationsof emerging entities with descriptions coming from a variety of sources, in a real-time,streaming manner. Collecting entity descriptions from different sources, and combiningthem in a single representation for improved retrieval effectiveness, is beneﬁcial inscenarios where information about the same entities is spread over multiple sources, e.g.,in a discovery scenario, where multiple sources that represent the activity of, e.g., peoplein a company (e.g., email collections, collaboration platforms, social media posts) can beused to paint the full picture. We answer:

RQ3

Can we leverage collective intelligence to construct entity representations for in-creased retrieval effectiveness of entities of interest?We answer this question by proposing an effective way of leveraging collective intelli-gence (i.e., entity descriptions from multiple sources) to construct entity representationsfor optimal retrieval effectiveness of entities of interest. More speciﬁcally, we collectdescriptions from knowledge bases, social media, and the web, to increase retrieval effec-tiveness of entities. We do this in a dynamic scenario, i.e., we distinguish between staticdescription sources, where descriptions are aggregated and added to entity representations,and dynamic description sources, where descriptions come in a real-time, streamingmanner.The main challenge with leveraging collective intelligence is the heterogeneity betweenentities: some entities may receive many descriptions (i.e., head entities), while othersmay receive few (tail entities). At the same time, different description sources representvery different content, compare, e.g., the structured and clean entity description from aknowledge base, to the contexts in which people refer to entities on social media.First, we demonstrate that incorporating dynamic description sources into dynamiccollective entity representations enables a better matching of users’ queries to entities,resulting in an increase of entity ranking effectiveness. In addition, we show that informingthe ranker on the expansion state of the entities, further increases the ranking effectiveness.Finally, we show how retraining the ranker leads to improved ranking effectivenessin dynamic collective entity representations, however, even a static (not periodicallyretrained) ranker’s performance improves over time, suggesting that even the static rankerbeneﬁts from newly incoming descriptions.The ﬁndings in this chapter imply that information from different types of sourcescan be effectively combined to improve retrieval effectiveness of entities of interest. Thescenario of different (heterogeneous) data sources that may be related, is not uncommonin the E-Discovery scenario, e.g., when an email box and associated attachments or ﬁlescan all be traced to a single entity.One limitation in the work presented in this chapter was related to the experimentaldesign, which was constrained by the lack of temporally aligned and sizeable datasets. Inparticular, the temporal misalignment between different corpora prevents the analysis oftemporal patterns that affect entities in unforeseen ways.

Part II. Analyzing and Predicting Activity from Digital Traces

In the ﬁrst chapter of this part, the digital traces under study are enterprise email, and theentities of interest are emailers.142 .1. Main Findings, Limitations, and Implications

Our aim in Chapter 6 is to provide insights in the aspects that guide communicationbetween people. Speciﬁcally, we study the impact of aspects of the communication graph(e.g., the strength of ties between emailers, or the proximity of emailers in the enterprisenetwork), and the impact of email content (e.g., the similarity of email content betweentwo emailers). We apply these signals for predicting enterprise email communication,which may ﬁnd application in detecting unexpected communication. We answer:

RQ4

Can we predict email communication through modeling email content and commu-nication graph properties?We present a hybrid model for email recipient prediction that leverages both the informa-tion from the email network’s communication graph, and the Language Models of theemailers, estimated with the content of the emails that are sent by each user. Our modelstarts from scratch, in that it does not assume or need seed recipients, and it is updated foreach email sent.We show that the communication graph properties and the email content signalsboth provide a strong baseline for predicting recipients, but are complementary, i.e.,combining both signals achieves the highest accuracy for predicting the recipients ofemail. Furthermore, we show that the number of received emails is an effective methodfor estimating the prior probability of observing a recipient, and we show that the numberof emails sent between two users is an effective way of estimating the “connectedness”between the two users.The implication of our ﬁndings is that both the context in which digital traces arecreated (i.e., the position of an emailer in the communication graph, her ties with surround-ing emailers), and the content (i.e., the email itself) are important signals in predictingreal-world behavior (i.e., email communication).Next, we study interaction logs with an intelligent personal assistant in Chapter 7. Intel-ligent assistants are proliferating across (both mobile and desktop) devices, they have aclose proximity to and embedding in our day-to-day life (in part thanks to their conversa-tional and “personal” nature). For these reasons, interaction logs with personal assistantsare a rich resource for digital evidence. In this chapter we take a similar approach to theprevious, and analyze a novel dataset of interaction logs, and address a prediction task tostudy how digital traces can be employed to predict real-world activity of our entities ofinterest.User interaction logs of intelligent personal assistants may contain many rich con-textual clues, and represent the user’s (planned) activities or geographic location at anygiven time. We are particularly interested in the reminder service of one such personalassistant (Microsoft Cortana), as this service is a heavily used feature by its users, andreminders represent (planned) tasks of users, and can hence be used to gain insights intotheir (planned) activities. In combination with other data, these logs could be a rich signalfor inferring real-world activity, a core task in discovery. We answer:

RQ5

Can we identify patterns in the times at which people create reminders, and, vianotiﬁcation times, when the associated tasks are to be executed?We analyze a large-scale user log, comprising over 500,000 time-based reminders of over90,000 users. We identify a body of common tasks types that give rise to the reminders143 . Conclusions across a large number of users. We subsequently arrange these tasks into a taxonomy.Finally, we study their temporal patterns, and we address a prediction task, where we aimto predict when a task reminder is set to trigger, i.e., the user aims to execute the task,given the reminder task description and creation time.We show how reminders display different temporal patterns depending on the tasktype they represent, the reminder’s creation time, and the terms in the task description. Weshow that the time at which a user creates a reminder is a strong indication of when thetask is scheduled to be executed. Furthermore, we show that including the text of the taskreminder further improves prediction accuracy. Much like in Chapter 6, we conﬁrm thatcombining the content of the digital traces (i.e., the textual description of the reminder),and the context in which they are created (i.e., the time of day), achieves the highestprediction accuracy, suggesting that both signals are complimentary.As in the previous chapter, the ﬁndings in this chapter imply that both the context (thetime of day), and the content (the reminder description) of digital traces provide signalsthat help in predicting real-world behavior of our entities of interest: the producers ofdigital traces.This chapter is not without limitations either. First, we make assumptions about thebehavior and activities of people, based on interaction logs. Without conducting userstudies or using other ways to infer the actual activities of people — beyond what theysay to Cortana, we cannot make any statements about people’s activities.

In this section, we summarize some of the limitations to the work presented in this thesis,and identify some areas for potential follow-up work.

Part I. Analyzing, Predicting, and Retrieving Emerging Entities

Chapter 3 is a starting point of studying how entities emerge in public discourse, and whathappens in the lead time between the entity ﬁrst surfaces, and is subsequently deemed“important enough” to be added to the KB. As a next step, we should take a closer lookat the detailed circumstances under which entities emerge, by not only considering thenumber of documents they appear in over time, but also in which contexts, e.g., bylooking at the content of the articles themselves. Another interesting aspect of emergingentities, that falls out of the scope of the present work, is the notion of when entities are“deemed important enough,” i.e., when the collective reaches consensus. For example,one could study the emergence patterns of entities that are removed from the KB. Finally,the observations made in this chapter could be explored in a prediction task, where, e.g.,given a partial entity time series, the task would be to predict the point at which the entitywill be incorporated in the KB.The method we present in Chapter 4 addresses entity mention detection for emergingentities, i.e., we leverage a knowledge base to label entity mentions, so a named-entityrecognizer can identify similar but unlinked mentions in social media posts. A naturalfollow-up to would be to “close the loop,” i.e., feed back the emerging entities to populatethe knowledge base. Closing the loop would open the door to a fully automated knowledge144 .2. Future Research Directions base population process, the scenario where the knowledge base is automatically populatedwith entity mentions and representations, which in turn means the pseudo-training datagenerated by the entity linking system remains up-to-date at all times.To achieve this, an additional step of constructing seed entity representations isnecessary. One approach is to mine keyphrases [120] from the tweets in which theemerging entities appear. To effectively construct these representations, an entity clusteringor disambiguation step would be beneﬁcial, to collect enough content to extract keyphrasesfrom [55, 215].Another challenge that lies in this direction is that of supervision; letting the systemroam free may introduce noise (i.e., false positives) into the knowledge base, whichmay over time corrupt training data and hence derail the system. Studying ways ofincorporating a “human-in-the-loop,” when evaluating and judging predictions, usinge.g., crowdsourcing ﬁnds applications in any task in which a self-learning, automated,system runs freely [60, 84]. The human-in-the-loop paradigm for evaluating and trackingalgorithmic predictions is of particular interest in the E-Discovery domain, where machine-generated predictions can have severe impact on the outcomes of, e.g., legal cases [197].A natural extension to the work presented in Chapter 5 is to study our method in a reallife scenario, with real-time, streaming data, and to provide a more ﬁne-grained analysisof the impact of the importance features and to to study the link between the temporalpatterns of entities (as studied in Chapter 3) and their retrieval effectiveness.Larger and temporally aligned data collections would also increase the potentialchallenge of “swamping” [223] or “document vector saturation” [138], where descriptionsources may completely overtake an entity representation. This swamping phenomenoncould be addressed by incorporating a temporal decay on the terms that make up therepresentation, or imposing a size limit on the ﬁelds. Furthermore, additional featureengineering, e.g. informing the ranker of the diversity of terms in a ﬁeld, or the novelty ofthe terms compared to the original entity representation, may also prove beneﬁcial whenthe number and volume of descriptions increase.In addition, we focused on a concrete end-to-end task in this chapter, we evaluate thequality of the representations by measuring the retrieval effectiveness, given the addedentity descriptions. Another direction would be to study the shifting entity representationsfrom a more analytic perspective. One could study the entity drift, or whether entityrepresentations change over time in meaningful ways, e.g., by studying which wordsare more strongly associated to entities at different points in time. Similar to studyinghow word use may shift over time [139], studying entity representation drift may provideinsights into the type of entities or events that are prone to change more or less intensely.

Part II. Analyzing and Predicting Activity from Digital Traces

Our study in Chapter 6 focused on the (relative) contribution of two different aspects;communication graph and email content properties. Engineering-wise, however, predictionaccuracy could be improved by a number of approaches: ﬁrst, employing the scores of ourgenerative model as features in a machine learning model, instead of using them directlyfor ranking, means multiple signals could be easily combined, and prediction accuracymay increase. Next, taking a more ﬁne-grained or local approach to the communicationgraph could further boost prediction accuracy, by incorporating (implicit) organizational145 . Conclusions structures of the enterprise. Finally, the analysis of our experimental results highlightedthat dealing with growing networks, and handling time in general, proved a challenge.Incorporating time-awareness into both components of the model may prove beneﬁcialfor performance over time, and studying the effect of communication patterns in evolvingnetworks [3, 248], may alleviate these weaknesses, and further improve our model’sprediction accuracy.Another opportunity for the work presented in this chapter is to adapt the presentedmethod to novel tasks. One could, e.g., look at role detection or inferring a company’shierarchical structure through leveraging both social network aspects and email content(e.g., common topics amongst employees). Furthermore, a predictive model with accuraterecipient predictions could be employed for rating the “likelihood” of historic interactions,and thus be applied for outlier or anomaly detection, to discover communication in anenterprise that is “unexpected,” and therefore may be of interest to, e.g., a digital forensicanalyst.The prediction task studied in Chapter 7 served as a starting point for further researchinto predicting (future) tasks and activities given historic signals. Future work includesdeveloping more sophisticated models. Whereas the current work focused on globalpatterns, seen frequently and across many users, considering more personalized signalsmay increase prediction accuracy; it is likely that users exhibit distinct patterns and usageof the reminder service. Furthermore, incorporating more signals into the task, e.g.,leveraging geographical signals and location services to better infer the current context ofthe user, to be able to distinguish between being at work, or at home, or driving/in transit,or even a more ﬁne-grained context, by looking up the type of location the user currentlyis at (e.g., a restaurant, grocery store).Furthermore, more sophisticated approaches to understanding the language of thereminder descriptions may prove beneﬁcial in clustering similar task reminders, to battlesparsity and yield stronger (temporal) signals for different task types. Using topic mod-eling, distributional or dimension reduction techniques (through e.g., word or reminderembeddings) could prove beneﬁcial in clustering similar task descriptions.Finally, the insights and predictions about tasks and the use of a personal assistant’sreminder service could also prove valuable in a more practical or commercial setting, e.g.,with developing systems endowed with the ability to proactively reserve time for users,automatically manage their scheduling and planning conﬂicts, remind people about tasksthey might forget, and, more generally, to help people achieve their goals.146 ibliography

ACM Computing Surveys , 47:10:1–10:36, 2014. (Cited on page 146.)[4] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behaviorinformation. SIGIR, pages 19–26. ACM, 2006. (Cited on pages 7 and 24.)[5] L. Akoglu, M. McGlohon, and C. Faloutsos. Oddball: Spotting anomalies in weighted graphs. PAKDD,pages 410–421. Springer, 2010. (Cited on page 109.)[6] E. Amitay, A. Darlow, D. Konopnicki, and U. Weiss. Queries as anchors: selection by association.HYPERTEXT, pages 193–201. ACM, 2005. (Cited on page 23.)[7] M. G. Armentano and A. A. Amandi. Recognition of user intentions for interface agents with variableorder markov models. UMAP, pages 173–184. Springer-Verlag, 2009. (Cited on page 25.)[8] R. Baeza-Yates, C. Hurtado, M. Mendoza, and G. Dupret. Modeling user search behavior. Latin AmericanWeb Congress, page 242. IEEE, 2005. (Cited on page 7.)[9] N. Balasubramanian and S. Cucerzan. Topic pages: An alternative to the ten blue links. ICSC, pages353–360. IEEE, 2010. (Cited on page 17.)[10] K. Balog and M. de Rijke. Finding experts and their details in e-mail corpora. WWW, pages 1035–1036.ACM, 2006. (Cited on page 25.)[11] K. Balog and K. Nørv˚ag. On the use of semantic knowledge bases for temporally-aware entity retrieval.ESAIR, pages 1–2. ACM, 2012. (Cited on page 23.)[12] K. Balog, M. Bron, and M. de Rijke. Category-based query modeling for entity search. ECIR, pages319–331. Springer-Verlag, 2010. (Cited on pages 23 and 81.)[13] K. Balog, A. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 entitytrack. TREC. NIST, 2010. (Cited on pages 22, 77, and 89.)[14] K. Balog, P. Serdyukov, and A. de Vries. Overview of the TREC 2010 entity track. TREC. NIST, 2011.(Cited on pages 22, 77, and 89.)[15] S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su. Optimizing web search using social annotations. WWW,pages 501–510. ACM, 2007. (Cited on page 23.)[16] K. Barmpatsalou, D. Damopoulos, G. Kambourakis, and V. Katos. A critical review of 7 years of mobiledevice forensics.

Digital Investigation , 10:323–349, 2013. (Cited on page 115.)[17] D. Barreau and B. A. Nardi. Finding and reminding: File organization from the desktop.

SIGCHI Bull. ,27:39–43, 1995. (Cited on page 26.)[18] D. Barua, J. Kay, B. Kummerfeld, and C. Paris. Modelling long term goals. UMAP, pages 1–12. Springer,2014. (Cited on page 25.)[19] A. E. C. Basave, A. Varga, M. Rowe, M. Stankovic, and A.-S. Dadzie. Making sense of microposts(

Foundationsand Trends in Information Retrieval , 10:119–271, 2016. (Cited on page 4.)[21] M. Becker, B. Hachey, B. Alex, and C. Grover. Optimising selective sampling for bootstrapping namedentity recognition. ICML Workshop on Learning with Multiple Views, pages 5–11, 2005. (Cited onpage 22.)[22] R. Bekkerman, A. McCallum, and G. Huang. Automatic categorization of email into folders: Benchmarkexperiments on Enron and SRI corpora.

Center for Intelligent Information Retrieval, Technical ReportIR , 418, 2004. (Cited on pages 25 and 90.)[23] D. J. Berndt and J. Clifford. Using dynamic time warping to ﬁnd patterns in time series. AAAIWS, pages359–370. AAAI Press, 1994. (Cited on page 39.)[24] B. Billerbeck, G. Demartini, C. Firan, T. Iofciu, and R. Krestel. Ranking entities using web search querylogs. ECDL, pages 273–281. Springer, 2010. (Cited on page 23.)[25] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far.

International Journal on SemanticWeb and Information Systems , 5:1–22, 2009. (Cited on page 17.)[26] K. Bontcheva and D. Rout. Making sense of social media streams through semantics: a survey.

SemanticWeb journal , 5:373–40, 2014. (Cited on pages 4 and 17.)[27] K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. TwitIE: Anopen-source information extraction pipeline for microblog text. RANLP, pages 83–90. Association for . Bibliography

Computational Linguistics, 2013. (Cited on page 20.)[28] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowledge bases.AAAI, pages 301–306. AAAI Press, 2011. (Cited on page 24.)[29] E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction. In

InternationalConference on Intelligence Analysis , 2005. (Cited on page 35.)[30] M. Brandimonte, G. O. Einstein, and M. A. McDaniel, editors.

Prospective Memory: Theory AndApplications . Psychology Press (Taylor & Francis Group), 1996. (Cited on page 26.)[31] A. Broder. A taxonomy of web search.

SIGIR Forum , 36(2):3–10, 2002. (Cited on page 13.)[32] A. Broder, E. Gabrilovich, V. Josifovski, G. Mavromatis, D. Metzler, and J. Wang. Exploiting site-levelinformation to improve web search. CIKM, pages 1393–1396. ACM, 2010. (Cited on page 24.)[33] M. Bron, B. Huurnink, and M. de Rijke. Linking archives using document enrichment and term selection.TPDL, pages 360–371. Springer, 2011. (Cited on page 19.)[34] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. NLDB, pages171–176. Springer-Verlag, 2012. (Cited on page 66.)[35] R. Bunescu. Using encyclopedic knowledge for named entity disambiguation. EACL, pages 9–16.Association for Computational Linguistics, 2006. (Cited on page 21.)[36] D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P. Hsu, and K. Wang. ERD 2014: Entity recognition anddisambiguation challenge.

SIGIR Forum , 2014. (Cited on page 22.)[37] D. Carmel, G. Halawi, L. Lewin-Eytan, Y. Maarek, and A. Raviv. Rank by time or by relevance?:Revisiting email search. CIKM, pages 283–292. ACM, 2015. (Cited on page 14.)[38] V. R. Carvalho and W. W. Cohen. On the collective classiﬁcation of email “speech acts”. SIGIR, pages345–352. ACM, 2005. (Cited on page 25.)[39] V. R. Carvalho and W. W. Cohen. Ranking users for intelligent message addressing. ECIR, pages321–333. Springer, 2008. (Cited on page 25.)[40] E. Casey.

Digital Evidence and Computer Crime: Forensic Science, Computers, and the Internet withCdrom . Academic Press, Inc., 2011. (Cited on page 115.)[41] T. Cassidy, Z. Chen, J. Artiles, H. Ji, H. Deng, L. Ratinov, J. Zheng, J. Han, and D. Roth. CUNY-UIUC-SRI TAC-KBP2011 entity linking system description. TAC. NIST, 2011. (Cited on page 19.)[42] T. Cassidy, H. Ji, L.-A. Ratinov, A. Zubiaga, and H. Huang. Analysis and enhancement of wikiﬁcation formicroblogs with context expansion. COLING, pages 441–456. Association for Computational Linguistics,2012. (Cited on pages 19 and 20.)[43] P. Chaurasia, S. McClean, C. D. Nugent, and B. Scotney. A duration-based online reminder system.

International Journal of Pervasive Computing and Communications , 10(3):337–366, 2014. (Cited onpage 26.)[44] W. W. Cohen, V. R. Carvalho, and T. M. Mitchell. Learning to classify email into “speech acts”. EMNLP,pages 309–316. Association for Computational Linguistics, 2004. (Cited on page 25.)[45] M. Collins. Discriminative training methods for hidden markov models: Theory and experiments withperceptron algorithms. EMNLP, pages 1–8. Association for Computational Linguistics, 2002. (Cited onpage 66.)[46] G. V. Cormack, M. R. Grossman, B. Hedin, and D. W. Oard. Overview of the TREC 2010 legal track.TREC. NIST, 2011. (Cited on pages 25 and 104.)[47] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. EMNLP, pages 708–716.Association for Computational Linguistics, 2007. (Cited on pages 19 and 20.)[48] S. Cucerzan. TAC entity linking by performing full-document entity extraction and disambiguation. TAC.NIST, 2011. (Cited on page 19.)[49] A. Culotta, R. Bekkerman, and A. Mccallum. Extracting social networks and contact information fromemail and the web. Conference on Email and Anti-Spam (CEAS), 2004. (Cited on page 25.)[50] M. Czerwinski, E. Horvitz, and S. Wilhite. A diary study of task switching and interruptions. CHI, pages175–182. ACM, 2004. (Cited on page 26.)[51] L. A. Dabbish, R. E. Kraut, S. Fussell, and S. Kiesler. Understanding email use: predicting action on amessage. CHI, pages 691–700. ACM, 2005. (Cited on page 25.)[52] J. Dalton, L. Dietz, and J. Allan. Entity query feature expansion using knowledge base links. SIGIR,pages 365–374. ACM, 2014. (Cited on page 17.)[53] J. Dalton, J. R. Frank, E. Gabrilovich, M. Ringgaard, and A. Subramanya. Fakba1: Freebase annotationof trec kba stream corpus, version 1 (release date 2015-01-26, format version 1, correction level 0), 2015.(Cited on pages 35 and 36.)[54] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborativeﬁltering. WWW, pages 271–280. ACM, 2007. (Cited on page 2.)

55] A. Davis, A. Veloso, A. S. da Silva, W. Meira, Jr., and A. H. F. Laender. Named entity disambiguationin streaming data. ACL, pages 815–824. Association for Computational Linguistics, 2012. (Cited onpage 145.)[56] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz. Predicting depression via social media.ICWSM. AAAI, 2013. (Cited on page 2.)[57] A. de Vries, A.-M. Vercoustre, J. A. Thom, N. Craswell, and M. Lalmas, editors.

Overview of the INEX2007 entity ranking track , INEX, 2008. Springer. (Cited on pages 22, 77, and 89.)[58] G. Demartini, A. de Vries, T. Iofciu, and J. Zhu, editors.

Overview of the INEX 2008 entity ranking track ,INEX, 2009. Springer-Verlag.[59] G. Demartini, T. Iofciu, and A. de Vries, editors.

Overview of the INEX 2009 entity ranking track , INEX,2010. Springer-Verlag. (Cited on pages 22, 77, and 89.)[60] G. Demartini, D. E. Difallah, and P. Cudr´e-Mauroux. Zencrowd: Leveraging probabilistic reasoning andcrowdsourcing techniques for large-scale entity linking. WWW, pages 469–478. ACM, 2012. (Cited onpage 145.)[61] L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact onsemantic annotation accuracy. HYPERTEXT, pages 21–30. Association for Computational Linguistics,2013. (Cited on page 20.)[62] L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva.Analysis of named entity recognition and linking for tweets.

Information Processing & Management , 51:32–49, 2015. (Cited on page 20.)[63] R. W. DeVaul, B. Clarkson, and A. S. Pentland. The memory glasses: Towards a wearable, context aware,situation-appropriate reminder system. In

Proceedings of the SIGCHI Workshop on Situated Interactionin Ubiquitous Computing , 2000. (Cited on pages 25 and 26.)[64] A. K. Dey and G. D. Abowd. Cybreminder: A context-aware system for supporting reminders. HUC,pages 172–186, London, UK, UK, 2000. Springer-Verlag. (Cited on page 25.)[65] D. Di Castro, Z. Karnin, L. Lewin-Eytan, and Y. Maarek. You’ve got mail, and here is what you coulddo with it!: Analyzing and predicting actions on email messages. WSDM, pages 307–316. ACM, 2016.(Cited on page 25.)[66] C. P. Diehl, L. Getoor, and G. Namata. Name reference resolution in organizational email archives. SDM,pages 20–22. SIAM, 2006. (Cited on page 25.)[67] J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus “it’salways about the people. enron is no different”.

Computational & Mathematical Organization Theory ,11:201–228, 2005. (Cited on page 25.)[68] R. K. Dismukes. Prospective memory in workplace and everyday situations.

Current Directions inPsychological Science , 21(4):215–220, 2012. (Cited on page 26.)[69] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang.Knowledge vault: A web-scale approach to probabilistic knowledge fusion. KDD, pages 601–610. ACM,2014. (Cited on page 20.)[70] S. Dumais, E. Cutrell, J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff i’ve seen: A system forpersonal information retrieval and re-use. SIGIR, pages 72–79. ACM, 2003. (Cited on page 14.)[71] M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through document expansion.SIGIR, pages 911–920, 2012. (Cited on page 23.)[72] G. O. Einstein and M. A. McDaniel. Normal aging and prospective memory.

Journal of ExperimentalPsychology: Learning, Memory and Cognition , 16(4):717–726, 1990. (Cited on page 26.)[73] N. Eiron and K. S. McCurley. Analysis of anchor text for web search. SIGIR, pages 459–460. ACM,2003. (Cited on pages 23 and 79.)[74] J. Ellis and L. Kvavilashvili. Prospective memory in 2000: Past, present, and future directions.

AppliedCognitive Psychology , 14(7):S1–S9, 2000. (Cited on page 26.)[75] T. Elsayed and D. W. Oard. Modeling identity in archival collections of email: A preliminary study.Conference on Email and Anti-Spam (CEAS), pages 95–103, 2006. (Cited on page 25.)[76] T. Elsayed, D. W. Oard, and G. Namata. Resolving personal names in email using context expansion.ACL, pages 265–268. Association for Computational Linguistics, 2008. (Cited on page 25.)[77] M. Eslami, A. Rickman, K. Vaccaro, A. Aleyasen, A. Vuong, K. Karahalios, K. Hamilton, and C. Sandvig.“i always assumed that i wasn’t really that close to [her]”: Reasoning about invisible algorithms in newsfeeds. CHI, pages 153–162. ACM, 2015. (Cited on page 2.)[78] Y. Fang and M.-W. Chang. Entity linking on microblogs with spatial and temporal signals.

Transactionsof the Association for Computational Linguistics , 2:259–272, 2014. (Cited on page 20.)[79] M. F¨arber, A. Rettinger, and B. El Asmar. On emerging entity detection. EKAW, pages 223–238. Springer, . Bibliography

Journal of the American Statistical Association ,64:1183–1210, 1969. (Cited on page 18.)[81] N. Fernandez, J. Fisteus, L. Sanchez, and E. Martin. Webtlab: A co-occurence-based approach to kbp2010 entity-linking task. TAC. NIST, 2010. (Cited on page 19.)[82] S. Fertig, E. Freeman, and D. Gelernter. Lifestreams: An alternative to the desktop metaphor. CHI, pages410–411. ACM, 1996. (Cited on page 26.)[83] S. Fertig, E. Freeman, and D. Gelernter. “ﬁnding and reminding” reconsidered.

SIGCHI Bull. , 28(1):66–69, 1996. (Cited on page 26.)[84] T. Finin, W. Murnane, A. Karandikar, N. Keller, J. Martineau, and M. Dredze. Annotating named entitiesin twitter data with crowdsourcing. CSLDAMT, pages 80–88. Association for Computational Linguistics,2010. (Cited on page 145.)[85] J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the problem of cascading errors: Approximate bayesianinference for linguistic annotation pipelines. EMNLP, pages 618–626. Association for ComputationalLinguistics, 2006. (Cited on pages 19, 20, and 60.)[86] J. R. Firth. A Synopsis of Linguistic Theory, 1930-1955.

Studies in Linguistic Analysis , pages 1–32,1957. (Cited on page 65.)[87] J. R. Firth.

Papers in linguistics, 1934-1951.

Oxford University Press, 1957. (Cited on page 2.)[88] J. R. Frank, M. Kleiman-Weiner, D. A. Roberts, F. Niu, C. Zhang, C. Re, and I. Soboroff. Building anentity-centric stream ﬁltering test collection for TREC 2012. TREC. NIST, 2012. (Cited on page 36.)[89] J. R. Frank, M. Kleiman-Weiner, D. A. Roberts, E. M. Voorhees, and I. Soboroff. Evaluating streamﬁltering for entity proﬁle updates in TREC 2012, 2013, and 2014. TREC. NIST, 2014. (Cited onpages 35 and 36.)[90] J. H. Friedman. Stochastic gradient boosting.

Computational Statistics & Data Analysis , 38(4):367–378,2002. (Cited on page 132.)[91] E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language process-ing.

Journal of Artiﬁcial Intelligence Research , 34:443–498, 2009. (Cited on page 17.)[92] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking.SIGIR, pages 355–362. ACM, 2009. (Cited on page 23.)[93] C. Gˆarbacea, D. Odijk, D. Graus, I. Sijaranamual, and M. de Rijke. Combining multiple signals forsemanticizing tweets: University of Amsterdam at

IEEE Pervasive Computing , 7:36–43, 2008. (Cited on page 1.)[95] Google. Building for the next moment. webpage, May 2015. (Cited on page 13.)[96] L. A. Granka. Measuring agenda setting with online search trafﬁc: Inﬂuences of online and traditionalmedia. In , 2010. (Cited on page 58.)[97] D. Graus, T. Kenter, M. Bron, E. Meij, and M. De Rijke. Context-based entity linking-university ofamsterdam at tac 2012. TAC, page 13. NIST, 2012. (Cited on pages 11 and 12.)[98] D. Graus, M.-H. Peetz, D. Odijk, O. de Rooij, and M. de Rijke. yourhistory–semantic linking for apersonalized timeline of historic events. volume 4 of

LinkedUp Veni Competition on Linked and OpenData . CEUR-WS, 2013. (Cited on page 12.)[99] D. Graus, Z. Ren, M. De Rijke, D. Van Dijk, H. Henseler, and N. Van Der Knaap. Semantic search ine-discovery: An interdisciplinary approach. volume 6 of

DESI V , 2013. (Cited on page 12.)[100] D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Semanticizing search engine queries:The university of amsterdam at the ERD 2014 challenge. ERD, pages 69–74. ACM, 2014. (Cited onpages 11 and 12.)[101] D. Graus, M. Tsagkias, L. Buitinck, and M. de Rijke. Generating pseudo-ground truth for predicting newconcepts in social streams. ECIR, pages 268–298. Springer, 2014. (Cited on page 11.)[102] D. Graus, D. van Dijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Recipient recommendation inenterprises using communication graphs and email content. SIGIR, pages 1079–1082. ACM, 2014.[103] D. Graus, P. N. Bennett, R. W. White, and E. Horvitz. Analyzing and predicting task reminders. UMAP,pages 7–15. ACM, 2016.[104] D. Graus, M. Tsagkias, W. Weerkamp, E. Meij, and M. de Rijke. Dynamic collective entity representationsfor entity ranking. WSDM, pages 595–604. ACM, 2016.[105] D. Graus, D. Odijk, and M. de Rijke. The birth of collective memories: Analyzing emerging entities intext streams. arXiv:1701.04039 , 2017. (Cited on page 11.)[106] M. Grbovic, G. Halawi, Z. Karnin, and Y. Maarek. How many folders do you really need?: Classifying mail into a handful of categories. CIKM, pages 869–878. ACM, 2014. (Cited on page 25.)[107] R. Grishman and B. Sundheim. Message understanding conference-6: A brief history. COLING, pages466–471. Association for Computational Linguistics, 1996. (Cited on page 18.)[108] S. Guo, M.-W. Chang, and E. Kiciman. To link or not to link? a study on end-to-end tweet entity linking.NAACL, pages 1020–1030. Association for Computational Linguistics, 2013. (Cited on page 22.)[109] Y. Guo, W. Che, T. Liu, and S. Li. A graph-based method for entity linking. IJCNLP, pages 1010–1018,2011. (Cited on page 19.)[110] I. Guy. Searching by talking: Analysis of voice queries on mobile web search. SIGIR, pages 35–44.ACM, 2016. (Cited on page 13.)[111] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking withwikipedia.

Artiﬁcial Intelligence , 194:130–150, 2013. (Cited on page 21.)[112] M. Halbwachs.

La m´emoire collective . Albin Michel, 1950. (Cited on page 31.)[113] D. A. Hanauer, K. Wentzell, N. Laffel, and L. M. Laffel. Computerized Automated Reminder DiabetesSystem (CARDS): e-mail and SMS cell phone text messaging reminders to support diabetes management.

Diabetes Technology & Therapeutics , 11(2):99–106, 2009. (Cited on page 26.)[114] F. Hasibi, K. Balog, and S. E. Bratsberg. Exploiting entity linking in queries for entity retrieval. ICTIR,pages 209–218. ACM, 2016. (Cited on pages 17 and 23.)[115] A. Hassan Awadallah, R. W. White, P. Pantel, S. T. Dumais, and Y.-M. Wang. Supporting complex searchtasks. CIKM, pages 829–838. ACM, 2014. (Cited on page 14.)[116] J. He, M. de Rijke, M. Sevenster, R. van Ommering, and Y. Qian. Generating links to backgroundknowledge: a case study using narrative radiology reports. CIKM, pages 1867–1876. ACM, 2011. (Citedon page 17.)[117] B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. TREC.NIST, 2010. (Cited on pages 25 and 104.)[118] H. Henseler. Network-based ﬁltering for large email collections in e-discovery.

Artif. Intell. Law , 2010.(Cited on page 104.)[119] J. L. Hicks, R. L. Marsh, and G. I. Cook. Task interference in time-based, event-based, and dual intentionprospective memory conditions.

Journal of Memory and Language , 53(3):430–444, 2005. (Cited onpage 26.)[120] J. Hoffart, Y. Altun, and G. Weikum. Discovering emerging entities with ambiguous names. WWW,pages 385–396. ACM, 2014. (Cited on pages 21, 75, and 145.)[121] K. Hofmann, S. Whiteson, and M. de Rijke. A probabilistic method for inferring preferences from clicks.CIKM, pages 249–258, 2011. (Cited on page 79.)[122] K. Hofmann, A. Schuth, S. Whiteson, and M. de Rijke. Reusing historical interaction data for fasteronline learning to rank for ir. WSDM, pages 183–192. ACM, 2013. (Cited on pages 2 and 7.)[123] K. Hong, P. Pei, Y.-Y. Wang, and D. Hakkani-Tur. Entity ranking for descriptive queries. In

SpokenLanguage Technology Workshop , pages 200–205. IEEE, 2014. (Cited on page 23.)[124] G.-J. Houben, G. McCalla, F. Pianesi, and M. Zancanaro, editors.

User Modeling, Adaptation, andPersonalization: 17th International Conference, UMAP 2009, formerly UM and AH , 2009. SpringerBerlin Heidelberg. (Cited on page 7.)[125] M. Hu, A. Sun, and E.-P. Lim. Comments-oriented document summarization: understanding documentswith readers’ feedback. SIGIR, pages 291–298. ACM, 2008. (Cited on page 67.)[126] M. Hu, S. Liu, F. Wei, Y. Wu, J. Stasko, and K.-L. Ma. Breaking news on twitter. CHI, pages 2751–2754.ACM, 2012. (Cited on page 59.)[127] Q. Hu, S. Bao, J. Xu, W. Zhou, M. Li, and H. Huang. Towards building effective email recipientrecommendation service. SOLI, pages 398–403. IEEE, 2012. (Cited on page 25.)[128] R. S. Ieong. { FORZA } — digital forensics investigation framework that incorporate legal issues. DigitalInvestigation , 3, Supplement:29–36, 2006. (Cited on page 2.)[129] M. J. Intons-Peterson and J. Fournier. External and internal memory aids: When and how often do weuse them?

Journal of Experimental Psychology: General , 115(3):267–280, 1986. (Cited on page 26.)[130] H. Ji, R. Grishman, H. T. Dang, X. Li, K. Grifﬁt, and J. Ellis. Overview of the TAC 2011 KnowledgeBase Population Track. TAC. NIST, 2011. (Cited on pages 19 and 21.)[131] T. Joachims. Optimizing search engines using clickthrough data. KDD, pages 133–142. ACM, 2002.(Cited on page 24.)[132] H. Joho, L. A. Azzopardi, and W. Vanderbauwhede. A survey of patent users: An analysis of tasks,behavior, search functionality and system requirements. IIiX, pages 13–24. ACM, 2010. (Cited onpage 14.)[133] E. Kamar and E. Horvitz. Jogger: Models for context-sensitive reminding. AAMAS, pages 1089–1090. . Bibliography

International Foundation for Autonomous Agents and Multiagent Systems, 2011. (Cited on pages 25and 26.)[134] M. Kampf, E. Tessenow, D. Y. Kenett, and J. W. Kantelhardt. The detection of emerging trends usingwikipedia trafﬁc data and context networks.

PLoS ONE , 10(12):1–19, 2016. (Cited on page 21.)[135] R. Kaptein, P. Serdyukov, A. de Vries, and J. Kamps. Entity ranking using wikipedia as a pivot. CIKM,pages 69–78. ACM, 2010. (Cited on page 23.)[136] P. N. Kapur, E. L. Glisky, and B. A. Wilson. Technological memory aids for people with memory deﬁcits.

Neuropsychological Rehabilitation , 14(1-2):41–60, 2004. (Cited on page 26.)[137] B. Keegan, D. Gergle, and N. Contractor. Hot off the wiki: Structures and dynamics of wikipedia’scoverage of breaking news events.

American Behavioral Scientist , 57(5):595–622, 2013. (Cited onpage 21.)[138] C. Kemp and K. Ramamohanarao. Long-term learning for web search engines. PKDD, pages 263–274.Springer, 2002. (Cited on pages 23, 99, and 145.)[139] T. Kenter, M. Wevers, P. Huijnen, and M. de Rijke. Ad hoc monitoring of vocabulary shifts over time.CIKM, pages 1191–1200. ACM, 2015. (Cited on page 145.)[140] M. A. Khalid, V. Jijkoun, and M. De Rijke. The impact of named entity normalization on informationretrieval for question answering. ECIR, pages 705–710. Springer-Verlag, 2008. (Cited on pages 4and 17.)[141] B. Klimt and Y. Yang. The enron corpus: A new dataset for email classiﬁcation research. ECML, pages217–226. Springer, 2004. (Cited on pages 24 and 107.)[142] Y. Koren, E. Liberty, Y. Maarek, and R. Sandler. Automatically tagging email by leveraging other users’folders. KDD, pages 913–921. ACM, 2011. (Cited on page 24.)[143] Z. Kozareva. Bootstrapping named entity recognition with automatically generated gazetteer lists.EACL-SRW, pages 15–21. Association for Computational Linguistics, 2006. (Cited on page 22.)[144] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of wikipedia entitiesin web text. KDD, pages 457–466. ACM, 2009. (Cited on page 21.)[145] R. Kumar and A. Tomkins. A characterization of online browsing behavior. WWW, pages 561–570.ACM, 2010. (Cited on page 77.)[146] M. Lamming and M. Flynn. “forget-me-not” intimate computing in support of human memory.

CognitiveStudies , 2(1), 1995. (Cited on page 25.)[147] C.-J. Lee and W. B. Croft. Incorporating social anchors for ad hoc retrieval. OAIR, pages 181–188.LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE,2013. (Cited on pages 23 and 79.)[148] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. WWW, pages915–924. ACM, 2008. (Cited on page 24.)[149] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. KDD,pages 497–506. ACM, 2009. (Cited on pages 58 and 59.)[150] A. Leuski. Email is a Stage: Discovering People Roles from Email Archives. SIGIR, pages 502–503.ACM, 2004. (Cited on page 25.)[151] S. C. Lewis and O. Westlund. Big data and journalism.

Digital Journalism , 3:447–466, 2015. (Cited onpage 2.)[152] F. Li, M. L. Lee, and W. Hsu. Entity proﬁling with varying source reliabilities. KDD, pages 1146–1155.ACM, 2014. (Cited on page 84.)[153] H. Li and J. Xu. Semantic matching in search.

Foundations and Trends in Information Retrieval , 7:343–469, 2014. (Cited on pages 14 and 17.)[154] T. W. Liao. Clustering of time series data—a survey.

Pattern Recognition , 38:1857–1874, 2005. (Citedon page 39.)[155] T. Lin, Mausam, and O. Etzioni. No noun phrase left behind: Detecting and typing unlinkable entities.EMNLP-CoNLL, pages 893–903. Association for Computational Linguistics, 2012. (Cited on page 21.)[156] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. Learning entity and relation embeddings for knowledge graphcompletion. AAAI, pages 2181–2187. AAAI Press, 2015. (Cited on page 24.)[157] J. Liu, P. Dolan, and E. R. Pedersen. Personalized news recommendation based on click behavior. IUI,pages 31–40. ACM, 2010. (Cited on page 7.)[158] J. Liu, E. Pedersen, and P. Dolan. Personalized news recommendation based on click behavior. IUI,pages 31–40. ACM, 2010. (Cited on page 7.)[159] T.-Y. Liu. Learning to rank for information retrieval.

Foundations and Trends in Information Retrieval , 3(3):225–331, 2009. (Cited on page 90.)[160] P. J. Ludford, D. Frankowski, K. Reily, K. Wilms, and L. Terveen. Because i carry my cell phone anyway: unctional location-based reminder applications. CHI, pages 889–898. ACM, 2006. (Cited on page 26.)[161] P. Lyman and H. R. Varian. How much information, 2003. [Online; accessed 16-August-2016]. (Citedon page 2.)[162] C. Macdonald, R. L. Santos, I. Ounis, and B. He. About learning models with multiple query-dependentfeatures.

ACM TOIS , 31(3):11:1–11:39, 2013. (Cited on pages 24, 79, and 85.)[163] T. W. Malone. How do people organize their desks?: Implications for the design of ofﬁce informationsystems.

ACM TOIS , 1(1):99–112, 1983. (Cited on page 26.)[164] C. D. Manning, P. Raghavan, and H. Sch¨utze.

Introduction to Information Retrieval . CambridgeUniversity Press, 2008. (Cited on page 15.)[165] C. D. Manning, P. Raghavan, and H. Sch¨utze.

Introduction to Information Retrieval , chapter 8, pages151–175. Cambridge University Press, 2008. (Cited on page 16.)[166] L. Manovich. Trending: the promises and the challenges of big social data.

Debates in the digitalhumanities , pages 460–475, 2011. (Cited on page 2.)[167] G. Marchionini. Exploratory search: From ﬁnding to understanding.

Communications of the ACM , 49:41–46, 2006. (Cited on page 2.)[168] J. Markoff. Armies of expensive lawyers, replaced by cheaper software.

New York Times , March 2011.(Cited on page 24.)[169] N. Marmasse and C. Schmandt. Location-aware information delivery with commotion. Handheld andUbiquitous Computing, pages 157–171. Springer-Verlag, 2000. (Cited on page 26.)[170] K. Massoudi, M. Tsagkias, M. de Rijke, and W. Weerkamp. Incorporating query expansion and qualityindicators in searching microblog posts. ECIR, pages 362–367. Springer, 2011. (Cited on page 67.)[171] A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks withexperiments on enron and academic email.

Journal of Artiﬁcial Intelligence Research , 30:249–272, 2007.(Cited on page 107.)[172] M. McGee-Lennon, M. Wolters, R. McLachlan, S. Brewster, and C. Hall. Name that tune: musicons asreminders in the home. CHI, pages 2803–2806. ACM, 2011. (Cited on page 25.)[173] M. R. McGee-Lennon, M. K. Wolters, and S. Brewster. User-centred multimodal reminders for assistiveliving. CHI, pages 2105–2114. ACM, 2011. (Cited on page 26.)[174] P. McNamee and H. Dang. Overview of the TAC 2009 knowledge base population track. TAC. NIST,2009. (Cited on page 18.)[175] O. Medelyan, I. Witten, and D. Milne. Topic indexing with wikipedia. In

Proceedings of AAAI Workshopon Wikipedia and Artiﬁcial Intelligence: an Evolving Synergy , pages 19–24. AAAI Press, 2008. (Citedon page 19.)[176] E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM, pages 563–572.ACM, 2012. (Cited on pages 5, 19, 20, 65, and 99.)[177] P. Mendes, M. Jakob, A. Garc´ıa-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web ofdocuments. I-Semantics, pages 1–8. ACM, 2011. (Cited on page 19.)[178] S. Meraz. Is there an elite hold? traditional media to social media agenda setting inﬂuence in blognetworks.

Journal of Computer-Mediated Communication , 14:682–707, 2009. (Cited on page 59.)[179] S. Meraz. Using time series analysis to measure intermedia agenda-setting inﬂuence in traditional mediaand political blog networks.

Journalism & Mass Communication Quarterly , 88:176–194, 2011. (Citedon page 59.)[180] D. Metzler, J. Novak, H. Cui, and S. Reddy. Building enriched document representations using aggregatedanchor text. SIGIR, pages 219–226. ACM, 2009. (Cited on pages 23 and 79.)[181] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. CIKM, pages233–242. ACM, 2007. (Cited on pages 18, 19, and 140.)[182] D. Milne and I. Witten. Learning to link with Wikipedia. CIKM, pages 509–518, New York, 2008. ACM.(Cited on pages 19, 20, and 67.)[183] E. Minkov, R. C. Wang, and W. W. Cohen. Extracting personal names from emails: Applying namedentity recognition to informal text. HLT, pages 443–450, Stroudsburg, PA, USA, 2005. Association forComputational Linguistics. (Cited on page 25.)[184] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email usinggraphs. SIGIR, pages 27–34. ACM, 2006. (Cited on page 25.)[185] G. Mishne and J. Lin. Twanchor text: A preliminary study of the value of tweets as anchor text. SIGIR,pages 1159–1160. ACM, 2012. (Cited on pages 79 and 83.)[186] S. Mizzaro. Relevance: The whole history.

Journal of the Association for Information Science andTechnology , 48:810–832, 1997. (Cited on page 1.)[187] A. Mohan, Z. Chen, and K. Q. Weinberger. Web-search ranking with initialized gradient boosted . Bibliography regression trees. YLRC, pages 77–89. JMLR.org, 2011. (Cited on page 90.)[188] M. Morita and Y. Shinoda. Information ﬁltering based on user behavior analysis and best match textretrieval. SIGIR, pages 272–281. Springer-Verlag New York, Inc., 1994. (Cited on page 7.)[189] E. Morozov. The rise of data and the death of politics.

The Guardian , 2014. (Cited on page 1.)[190] D. Mottin, T. Palpanas, and Y. Velegrakis. Entity ranking using click-log information.

Intelligent DataAnalysis , 17(5):837–856, 2013. (Cited on page 23.)[191] D. M¨ullner. fastcluster: Fast hierarchical, agglomerative clustering routines for r and python.

Journal ofStatistical Software , 53(1):1–18, 2013. (Cited on page 40.)[192] N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. ACL,pages 1488–1497. The Association for Computer Linguistics, 2013. (Cited on page 21.)[193] F. Niu, C. Zhang, C. R´e, and J. Shavlik. Elementary: Large-scale knowledge-base construction viamachine learning and statistical inference.

International Journal on Semantic Web and InformationSystems , 8:42–73, 2012. (Cited on page 20.)[194] M. Noll and C. Meinel. The metadata triumvirate: Social annotations, anchor texts and search queries.WI-IAT, pages 640–647. IEEE, 2008. (Cited on pages 23 and 83.)[195] J. Nothman, N. Ringland, W. Radford, T. Murphy, and J. R. Curran. Learning multilingual named entityrecognition from wikipedia.

Artiﬁcial Intelligence , 194:151 – 175, 2013. (Cited on page 22.)[196] D. Oard, W. Webber, D. Kirsch, and S. Golitsynskiy. Avocado research email collection ldc2015t03.DVD, February 2015. (Cited on page 107.)[197] D. W. Oard and W. Webber. Information retrieval for e-discovery.

Foundations and Trends in InformationRetrieval , 7(2-3):99–237, 2013. (Cited on pages 2, 3, 4, 7, 99, 115, 139, and 145.)[198] B. O’Conaill and D. Frohlich. Timespace in the workplace: Dealing with interruptions. CHI, pages262–263. ACM, 1995. (Cited on page 26.)[199] D. Odijk, E. Meij, and M. de Rijke. Feeding the second screen: Semantic linking based on subtitles.OAIR, pages 9–16. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUEDOCUMENTAIRE, 2013. (Cited on pages 5, 19, 65, and 66.)[200] J. O’Neill, C. Privault, J.-M. Renders, V. Ciriza, and G. Bauduin. Disco: Intelligent help for documentreview. DESI III, 2009. (Cited on pages 7 and 103.)[201] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to theweb. Technical report, Stanford InfoLab, 1999. (Cited on page 104.)[202] C. Pal and A. McCallum. Cc prediction with graphical models. CEAS, 2006. (Cited on page 25.)[203] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity andentity set expansion. EMNLP, pages 938–947. Association for Computational Linguistics, 2009. (Citedon page 22.)[204] K. Partridge and B. Price.

Enhancing Mobile Recommender Systems with Activity Inference , pages307–318. UMAP. Springer Berlin Heidelberg, 2009. (Cited on page 25.)[205] C. Pentzold. Fixing the ﬂoating gap: The online encyclopaedia Wikipedia as a global memory place.

Memory Studies , 2:255–272, 2009. (Cited on pages 4 and 31.)[206] J. R. P´erez-Ag¨uera, J. Arroyo, J. Greenberg, J. P. Iglesias, and V. Fresno. Using bm25f for semanticsearch. SEMSEARCH, page Article 2. ACM, 2010. (Cited on page 24.)[207] S. Petrovic, M. Osborne, R. Mccreadie, C. Macdonald, and I. Ounis. Can twitter replace newswire forbreaking news? ICWSM. AAAI Press, 2013. (Cited on pages 58 and 59.)[208] S. Pichai. Keynote. Google I/O, 05 2016. (Cited on page 13.)[209] D. Ploch, L. Hennig, E. De Luca, and S. Albayrak. Dai approaches to the tac-kbp 2011 entity linkingtask. TAC. NIST, 2011. (Cited on page 20.)[210] A. Qadir, M. Gamon, P. Pantel, and A. H. Awadallah. Activity modeling in email. NAACL-HLT, pages1452–1462. Association for Computational Linguistics, 2016. (Cited on page 25.)[211] W. Radford, B. Hachey, M. Honnibal, J. Nothman, and J. Curran. Naıve but effective NIL clusteringbaselines-CMCRC at TAC 2011. TAC. NIST, 2011. (Cited on page 19.)[212] S. Radicati. Email statistics report, 2016-2020 – executive summary, June 2016. (Cited on page 103.)[213] L. Ramshaw and M. Marcus.

Text Chunking using Transformation-Based Learning , pages 157–176.Third Workshop on Very Large Corpora. Springer, 1999. (Cited on page 66.)[214] S. Rani and G. Sikka. Recent techniques of clustering of time series data: A survey.

International Journalof Computer Applications , 52:1–9, 2012. (Cited on page 38.)[215] D. Rao, P. McNamee, and M. Dredze. Streaming cross document entity coreference resolution. COLING,pages 1050–1058. Association for Computational Linguistics, 2010. (Cited on page 145.)[216] D. Rao, P. McNamee, and M. Dredze.

Entity Linking: Finding Extracted Entities in a Knowledge Base ,pages 93–115. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. (Cited on pages 4 and 17.)

Proceedings of Practical Application of Intelligent Agents and Multi Agent Technology , pages487–495, 1996. (Cited on page 26.)[220] M. Richardson. Learning about the world through long-term query logs.

ACM Transactions on the Web ,2(4):21:1–21:27, 2008. (Cited on page 24.)[221] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction from twitter. KDD, pages1104–1112. ACM, 2012. (Cited on page 20.)[222] S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at trec–3. TREC,pages 109–126. NIST, 1995. (Cited on page 15.)[223] S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted ﬁelds. CIKM,pages 42–49. ACM, 2004. (Cited on pages 24, 85, 89, 99, and 145.)[224] R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here.

Proceedings ofthe IEEE , 88(8):1270–1278, 2000. (Cited on page 15.)[225] M. Roth, A. Ben-David, D. Deutscher, G. Flysher, I. Horn, A. Leichtberg, N. Leiser, Y. Matias, andR. Merom. Suggesting friends using the implicit social graph. KDD, pages 233–242. ACM, 2010. (Citedon page 25.)[226] F. Scholer, H. E. Williams, and A. Turpin. Query association surrogates for web search: Research articles.

JASIST , 55(7):637–650, 2004. (Cited on pages 23 and 81.)[227] A. Schuth.

Search Engines that Learn from Their Users . PhD thesis, Informatics Institute, University ofAmsterdam, 2016. (Cited on page 2.)[228] M. F. Schwartz and D. C. M. Wood. Discovering shared interests among people using graph analysis ofglobal electronic mail trafﬁc.

Communications of the ACM , 36:78–89, 1993. (Cited on page 25.)[229] A. J. Sellen, G. Louie, J. E. Harris, and A. J. Wilkins. What brings intentions to mind? An in situ studyof prospective memory.

Memory , 5(4):483–507, 1997. (Cited on page 26.)[230] W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions.

TKDE , 27(2):443–460, 2014. (Cited on pages 18, 23, and 81.)[231] J. Shetty and J. Adibi. Discovering important nodes through graph entropy the case of enron emaildatabase. LinkKDD, pages 74–81. ACM, 2005. (Cited on page 25.)[232] A. Singhal and F. Pereira. Document expansion for speech retrieval. SIGIR, pages 34–41. ACM, 1999.(Cited on pages 23 and 79.)[233] I. M. Soboroff, I. Ounis, C. Macdonald, and J. Lin. Overview of the trec 2011 microblog track. TREC.NIST, 2012. (Cited on page 69.)[234] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledgebase completion. NIPS, pages 926–934. Curran Associates, Inc., 2013. (Cited on page 24.)[235] W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of nounphrases.

Computational Linguistics , 27:521–544, 2001. (Cited on page 18.)[236] K. Sp¨arck Jones. A statistical interpretation of term speciﬁcity and its application in retrieval.

Journal ofDocumentation , 28:11–21, 1972. (Cited on page 15.)[237] J. Stasko, C. G¨org, and Z. Liu. Jigsaw: Supporting investigative analysis through interactive visualization.

Information Visualization , 7:118–132, 2008. (Cited on page 4.)[238] B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. The singularity is not near: Slowing growth of wikipedia.WikiSym, pages 8:1–8:10. ACM, 2009. (Cited on page 31.)[239] K. M. Svore and C. J. Burges. A machine learning approach for improved bm25 retrieval. CIKM, pages1811–1814. ACM, 2009. (Cited on page 24.)[240] G. Szabo and B. A. Huberman. Predicting the popularity of online content.

Communications of the ACM ,53(8):80–88, 2010. (Cited on page 89.)[241] J. Tague, M. Nelson, and H. Wu. Problems in the simulation of bibliographic retrieval systems. SIGIR,pages 236–255. Butterworth & Co., 1980. (Cited on page 22.)[242] O. Tene and J. Polonetsky. Big data for all: Privacy and user control in the age of analytics.

SocialScience Research Network Working Paper Series , 11:239–273, 2013. (Cited on page 1.)[243] E. F. Tjong Kim Sang. Introduction to the conll-2002 shared task: Language-independent named entityrecognition. COLING, pages 1–4. Association for Computational Linguistics, 2002. (Cited on page 18.)[244] M. Tsagkias, W. Weerkamp, and M. de Rijke. News comments: Exploring, modeling, and online . Bibliography prediction. ECIR, pages 191–203. Springer, 2010. (Cited on page 89.)[245] Y. Tu, L. Chen, M. Lv, Y. Ye, W. Huang, and G. Chen. ireminder: An intuitive location-based reminderthat knows where you are going.

International Journal of Human–Computer Interaction , 29(12):838–850,2013. (Cited on page 26.)[246] G. Tuchman. Objectivity as strategic ritual: An examination of newsmen’s notions of objectivity.

American Journal of Sociology , 77:660–679, 1972. (Cited on page 58.)[247] Z. Tufekci. Algorithmic harms beyond facebook and google: Emergent challenges of computationalagency symposium essays.

Colorado Technology Law Journal , 13:203, 2015. (Cited on page 1.)[248] T. Tylenda, R. Angelova, and S. Bedathur. Towards time-aware link prediction in evolving social networks.SNA-KDD, page Article 9. ACM, 2009. (Cited on page 146.)[249] P. Vakkari. A theory of the task-based information retrieval process: a summary and generalisation of alongitudinal study.

Journal of Documentation , 57:44–60, 2001. (Cited on page 14.)[250] D. van Dijk, H. Henseler, and M. de Rijke. Semantic search in e-discovery. DESI IV, 2011. (Cited onpages 4 and 139.)[251] D. van Dijk, D. Graus, Z. Ren, H. Henseler, and M. de Rijke. Who is involved? semantic search fore-discovery. DESI VI, 2015. (Cited on page 12.)[252] A.-M. Vercoustre, J. A. Thom, and J. Pehcevski. Entity ranking in wikipedia. SAC, pages 1101–1106.ACM, 2008. (Cited on page 23.)[253] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts foronline search queries. SIGMOD, pages 131–142. ACM, 2004. (Cited on pages 38, 39, and 43.)[254] U. Von Luxburg, R. C. Williamson, and I. Guyon. Clustering: Science or art? UTLW, pages 65–79.JMLR.org, 2011. (Cited on pages 40, 60, and 140.)[255] N. Voskarides, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Query-dependent contextualizationof streaming data. ECIR, pages 706–712. Springer, 2014. (Cited on page 21.)[256] D. P. Wallace and C. Van Fleet. From the editors: The democratization of information? Wikipedia as areference resource.

Reference & User Services Quarterly , 45:100–103, 2005. (Cited on page 4.)[257] J. H. Ward Jr. Hierarchical grouping to optimize an objective function.

Journal of the American StatisticalAssociation , 58(301):236–244, 1963. (Cited on page 40.)[258] W. Weerkamp and M. de Rijke. Credibility-inspired ranking for blog post retrieval.

Information Retrieval ,15(3-4):243–277, 2012. (Cited on page 67.)[259] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, urls and anchors.TREC. NIST, 2001. (Cited on pages 23, 79, and 82.)[260] R. W. White and S. M. Drucker. Investigating behavioral variability in web search. WWW, pages 21–30.ACM, 2007. (Cited on page 24.)[261] R. W. White, N. P. Tatonetti, N. H. Shah, R. B. Altman, and E. Horvitz. Web-scale pharmacovigilance:listening to signals from the crowd.

Journal of the American Medical Informatics Association , 20(3):404–408, 2013. (Cited on page 2.)[262] R. W. White, M. Richardson, and W.-t. Yih. Questions vs. queries in informational search tasks. WWW,pages 135–136. ACM, 2015. (Cited on page 14.)[263] Wikipedia. Hillary clinton email controversy — wikipedia, the free encyclopedia, 2016. [Online; accessed19-September-2016]. (Cited on page 2.)[264] Wikipedia. Enron scandal — wikipedia, the free encyclopedia, 2016. [Online; accessed 19-September-2016]. (Cited on page 2.)[265] Wikipedia. Panama papers — wikipedia, the free encyclopedia, 2016. [Online; accessed 19-September-2016]. (Cited on page 2.)[266] Wikipedia. Information published by wikileaks — wikipedia, the free encyclopedia, 2016. [Online;accessed 19-September-2016]. (Cited on page 2.)[267] Wikipedia. Wikipedia:size of wikipedia — wikipedia, the free encyclopedia, 2016. [Online; accessed12-September-2016]. (Cited on page 4.)[268] J. wook Ahn, P. Brusilovsky, J. Grady, D. He, and R. Florian. Semantic annotation based exploratorysearch for information analysts.

Information Processing & Management , 46(4):383 – 402, 2010. (Citedon pages 3, 4, 17, 77, and 139.)[269] D. Wu, W. S. Lee, N. Ye, and H. L. Chieu. Domain adaptive bootstrapping for named entity recognition.EMNLP, pages 1523–1532. Association for Computational Linguistics, 2009. (Cited on page 22.)[270] M. Wu, D. Hawking, A. Turpin, and F. Scholer. Using anchor text for homepage and topic distillationsearch tasks.

JASIST , 63(6):1235–1255, 2012. (Cited on pages 23 and 79.)[271] G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using webclick-through data. CIKM, pages 118–126. ACM, 2004. (Cited on page 23.)

PLoS ONE , 7(1):1–8, 2012. (Cited on page 21.)[273] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kert´esz. Dynamics of conﬂicts in wikipedia.

PLoS ONE ,7(6):1–12, 2012. (Cited on pages 21 and 64.)[274] K. Yelupula and S. Ramaswamy. Social network analysis for email classiﬁcation. ACM-SE 46, pages469–474. ACM, 2008. (Cited on page 25.)[275] E. Yom-Tov and E. Gabrilovich. Postmarket drug surveillance without trial costs: Discovery of adversedrug reactions through large-scale analysis of web search queries.

Journal of Medical Internet Research ,15:e124, 2013. (Cited on page 2.)[276] W. Youyou, M. Kosinski, and D. Stillwell. Computer-based personality judgments are more accuratethan those made by humans.

PNAS , 112:1036–1040, 2015. (Cited on page 1.)[277] H. Zaragoza, N. Craswell, M. Taylor, S. Saria, and S. Robertson. Microsoft Cambridge at TREC-13:Web and hard tracks. TREC. NIST, 2004. (Cited on page 24.)[278] W. Zhang, J. Su, C. L. Tan, and W. T. Wang. Entity linking leveraging: Automatically generatedannotation. COLING, pages 1290–1298. Association for Computational Linguistics, 2010. (Cited onpage 22.)[279] Y. Zhou, L. Nie, O. Rouhani-Kalleh, F. Vasile, and S. Gaffney. Resolving surface forms to wikipediatopics. COLING, pages 1335–1343. Association for Computational Linguistics, 2010. (Cited onpage 22.)

IKS Dissertation Series

DEGAS: An Active,Temporal Database of Autonomous Objects

Information Retrieval byGraphically Browsing Meta-Information

A Contribution to the Linguis-tic Analysis of Business Conversations

Memory versus Search inGames

Computerondersteuning bijStraftoemeting

Physiology of Quality ChangeModelling: Automated modelling of

Classiﬁcation using decisiontrees and neural nets

The Nature of Minimax Search

The practical Art of MovingPhysical Objects

Empowering Communities:A Method for the Legitimate User-Driven

Re-design of compo-sitional systems

Veriﬁcation support for objectdatabase design

Informed Gambling:Conception and Analysis of a Multi-Agent Mecha-nism

Perspectives on ImprovingSoftware Maintenance

Prototyping of CMS StorageManagement

Sociaal-organisatorische gevolgen van kennistechnologie

ETAG, A Formal Model ofCompetence Knowledge for User Interface

Knowledge-based QueryFormulation in Information Retrieval

Programming Languagesfor Agent Communication

Decision-theoretic Planning ofClinical Patient Management

Sensitivity Analyis ofDecision-Theoretic Networks

Principles of ProbabilisticQuery Optimization

10 Niels Nes (CWI)

Image Database ManagementSystem Design Considerations, Algorithms andArchitecture

11 Jonas Karlsson (CWI)

Scalable Distributed DataStructures for Database Management

Qualitative Approaches toQuantifying Probabilistic Networks

Agent Programming Lan-guages: Programming with Mental Models

Learning as problemsolving

Conjunctive and Disjunc-tive Version Spaces with Instance-Based BoundarySets

Processing Struc-tured Hypermedia: A Matter of Style

Task-based User Inter-face Design

Diva: ArchitecturalPerspectives on Information Visualization

A Compositional SemanticStructure for Multi-Agent Systems Dynamics

Towards Distributed De-velopment of Large Object-Oriented Models

10 Maarten Sierhuis (UvA)

Modeling and SimulatingWork Practice

11 Tom M. van Engers (VUA)

Knowledge Manage-ment

Architecture-Level Modiﬁa-bility Analysis

Modelling and searchingweb-based document collections

Database Optimization As-pects for Information Retrieval

The DiscreteAcyclic Digraph Markov Model in Data Mining

The Private Cyberspace Mod-eling Electronic

Applied legal epistemol-ogy: Building a knowledge-based ontology of

Monet: A Next-GenerationDBMS Kernel For Query-Intensive

Value Based RequirementsEngineering: Exploring Innovative

IntegratingModern Business Applications with ObjectiﬁedLegacy

10 Brian Sheppard (UM)

Towards Perfect Play ofScrabble

11 Wouter C. A. Wijngaards (VUA)

Agent BasedModelling of Dynamics: Biological and Organ-isational Applications

12 Albrecht Schmidt (UvA)

Processing XML inDatabase Systems

13 Hongjing Wu (TUe)

A Reference Architecture forAdaptive Hypermedia Applications

14 Wieke de Vries (UU)

Agent Interaction: AbstractApproaches to Modelling, Programming and Veri-fying Multi-Agent Systems

15 Rik Eshuis (UT)

Semantics and Veriﬁcation ofUML Activity Diagrams for Workﬂow Modelling

16 Pieter van Langen (VUA)

The Anatomy of Design:Foundations, Models and Applications

17 Stefan Manegold (UvA)

Understanding, Model-ing, and Improving Main-Memory Database Per-formance

Ontology-BasedInformation Sharing in Weakly Structured Environ-ments

Modal Action Logics for Rea-soning About Reactive Systems

Human-Computer Inter-action and Presence in Virtual Reality ExposureTherapy

Content-Based Video Re-trieval Supported by Database Technology

Causation in Artiﬁcial Intelli-gence and Law: A modelling approach

Development and speci-ﬁcation of virtual environments

Formal Explorations ofKnowledge Intensive Tasks

Repair Based Scheduling

The resolution of visuallyguided behaviour

10 Andreas Lincke (UvT)

Electronic Business Nego-tiation: Some experimental studies on the inter-action between medium, innovation context andculture

11 Simon Keizer (UT)

Reasoning under Uncertaintyin Natural Language Dialogue using Bayesian Net-works

12 Roeland Ordelman (UT)

Dutch speech recognitionin multimedia information retrieval

13 Jeroen Donkers (UM)

Nosce Hostem: Searchingwith Opponent Models

14 Stijn Hoppenbrouwers (KUN)

Freezing Lan-guage: Conceptualisation Processes across ICT-Supported Organisations

15 Mathijs de Weerdt (TUD)

Plan Merging in Multi-Agent Systems

16 Menzo Windhouwer (CWI)

Feature Grammar Sys-tems: Incremental Maintenance of Indexes to Dig-ital Media Warehouses

17 David Jansen (UT)

Extensions of Statecharts withProbability, Time, and Stochastic Timing

18 Levente Kocsis (UM)

Learning Search Decisions

A Model for Organiza-tional Interaction: Based on Agents, Founded inLogic

Monitoring Multi-party Contractsfor E-business

A Theoretical and EmpiricalAnalysis of Approximation in Symbolic ProblemSolving

Organizational Principlesfor Multi-Agent Architectures

Knowledge discovery andmonotonicity

The Evaluation of Busi-ness Process Modeling Techniques

Voorbeeldig onderwijs: voor-beeldgestuurd onderwijs, een opstap naar abstractdenken, vooral voor meisjes

Politie en de Nieuwe Interna-tionale Informatiemarkt, Grensregionale politi¨elegegevensuitwisseling en digitale expertise

For the Sake of the Argu-ment: explorations into argument-based reasoning

10 Suzanne Kabel (UvA)

Knowledge-rich indexing oflearning-objects

11 Michel Klein (VUA)

Change Management for Dis-tributed Ontologies

12 The Duy Bui (UT)

Creating emotions and facialexpressions for embodied agents

13 Wojciech Jamroga (UT)

Using Multiple Models ofReality: On Agents who Know how to Play

14 Paul Harrenstein (UU)

Logic in Conﬂict. LogicalExplorations in Strategic Equilibrium

15 Arno Knobbe (UU)

Multi-Relational Data Mining

16 Federico Divina (VUA)

Hybrid Genetic RelationalSearch for Inductive Learning

17 Mark Winands (UM)

Informed Search in ComplexGames

18 Vania Bessa Machado (UvA)

Supporting the Con-struction of Qualitative Knowledge Models

19 Thijs Westerveld (UT)

Using generative proba-bilistic models for multimedia retrieval

20 Madelon Evers (Nyenrode)

Learning from Design:facilitating multidisciplinary design teams

Methodological Aspects ofDesigning Induction-Based Applications

AI techniques for the gameof Go

A Pragmatic Approach tothe Conceptualisation of Language

Towards Database Supportfor Moving Object data

Two-Level Proba-bilistic Grammars for Natural Language Parsing

Adaptive Game AI

Hypermedia PresentationGeneration for Semantic Web Information Systems

A Model-driven Approachfor Building Distributed Ontology-based Web Ap-plications

Storage, Querying and In-ferencing for Semantic Web Languages

10 Anders Bouwer (UvA)

Explaining Behaviour: Us-ing Qualitative Simulation in Interactive LearningEnvironments

11 Elth Ogston (VUA)

Agent Based Matchmakingand Clustering: A Decentralized Approach toSearch

12 Csaba Boer (EUR)

Distributed Simulation in In-dustry

13 Fred Hamburg (UL)

Een Computermodel voor hetOndersteunen van Euthanasiebeslissingen

14 Borys Omelayenko (VUA)

Web-Service conﬁgura-tion on the Semantic Web: Exploring how seman-tics meets pragmatics

15 Tibor Bosse (VUA)

Analysis of the Dynamics ofCognitive Processes

16 Joris Graaumans (UU)

Usability of XML QueryLanguages

17 Boris Shishkov (TUD)

Software SpeciﬁcationBased on Re-usable Business Components

18 Danielle Sent (UU)

Test-selection strategies forprobabilistic networks

19 Michel van Dartel (UM)

Situated Representation

20 Cristina Coteanu (UL)

Cyber Consumer Law, Stateof the Art and Perspectives

21 Wijnand Derks (UT)

Improving Concurrency andRecovery in Database Systems by Exploiting Ap-plication Semantics

Foundations of B2B Elec-tronic Contracting

Contextual issues in thedesign and use of information technology in orga-nizations

The role of metacognitiveskills in learning to solve problems

Building Web Service Ontolo-gies

Validation Techniques for Object-Oriented Proof Outlines

Software-aided ServiceBundling: Intelligent Methods & Tools for Graphi-cal Service Modeling

XML schema matching:balancing efﬁciency and effectiveness by meansof clustering

Forward, Back and HomeAgain: Analyzing User Behavior on the Web

Automatic Formulationof the Auditor’s Opinion

10 Ronny Siebes (VUA)

Semantic Routing in Peer-to-Peer Systems

11 Joeri van Ruth (UT)

Flattening Queries overNested Data Types

12 Bert Bongers (VUA)

Interactivation: Towards ane-cology of people, our technological environment,and the arts

13 Henk-Jan Lebbink (UU)

Dialogue and DecisionGames for Information Exchanging Agents

14 Johan Hoorn (VUA)

Software Requirements: Up-date, Upgrade, Redesign - towards a Theory ofRequirements Change

15 Rainer Malik (UU)

CONAN: Text Mining in theBiomedical Domain

16 Carsten Riggelsen (UU)

Approximation Methodsfor Efﬁcient Learning of Bayesian Networks

17 Stacey Nagata (UU)

User Assistance for Multitask-ing with Interruptions on a Mobile Device

18 Valentin Zhizhkun (UvA)

Graph transformationfor Natural Language Processing

19 Birna van Riemsdijk (UU)

Cognitive Agent Pro-gramming: A Semantic Approach

20 Marina Velikova (UvT)

Monotone models for pre-diction in data mining

21 Bas van Gils (RUN)

Aptness on the Web

22 Paul de Vrieze (RUN)

Fundaments of AdaptivePersonalisation

23 Ion Juvina (UU)

Development of Cognitive Modelfor Navigating on the Web

24 Laura Hollink (VUA)

Semantic Annotation for Re-trieval of Visual Resources

25 Madalina Drugan (UU)

Conditional log-likelihoodMDL and Evolutionary MCMC

26 Vojkan Mihajlovic (UT)

Score Region Algebra:A Flexible Framework for Structured InformationRetrieval

27 Stefano Bocconi (CWI)

Vox Populi: generatingvideo documentaries from semantically annotatedmedia repositories

28 Borkur Sigurbjornsson (UvA)

Focused Informa-tion Access using XML Element Retrieval

Access Control and Service-Oriented Architectures

Reconciling Information Ex-change and Conﬁdentiality: A Formal Approach

Social Networks and the Se-mantic Web

Achieving Seman-tic Interoperability in Multi-agent Systems: adialogue-based approach

Software Agents, Surveillance,and the Right to Privacy: a Legislative Frameworkfor Agent-enabled Surveillance

Applied Text Analytics forBlogs

To Whom It May Concern:Addressee Identiﬁcation in Face-to-Face Meetings

Modeling of Changein Multi-Agent Organizations

Agent-Based Mediated Ser-vice Negotiation

10 Huib Aldewereld (UU)

Autonomy vs. Conformity:an Institutional Perspective on Norms and Proto-cols

11 Natalia Stash (TUe)

Incorporating Cogni-tive/Learning Styles in a General-Purpose Adap-tive Hypermedia System

12 Marcel van Gerven (RUN)

Bayesian Networks forClinical Decision Support: A Rational Approachto Dynamic Decision-Making under Uncertainty

13 Rutger Rienks (UT)

Meetings in Smart Environ-ments: Implications of Progressing Technology

14 Niek Bergboer (UM)

Context-Based Image Analy-sis

15 Joyca Lacroix (UM)

NIM: a Situated Computa-tional Memory Model

16 Davide Grossi (UU)

Designing Invisible Hand-cuffs. Formal investigations in Institutions andOrganizations for Multi-agent Systems

17 Theodore Charitos (UU)

Reasoning with DynamicNetworks in Practice

18 Bart Orriens (UvT)

On the development an man-agement of adaptive business collaborations

19 David Levy (UM)

Intimate relationships with arti-ﬁcial partners

20 Slinger Jansen (UU)

Customer Conﬁguration Up-dating in a Software Supply Network

21 Karianne Vermaas (UU)

Fast diffusion and broad-ening use: A research on residential adoption andusage of broadband internet in the Netherlandsbetween 2001 and 2005

22 Zlatko Zlatev (UT)

Goal-oriented design of valueand process models from patterns

23 Peter Barna (TUe)

Speciﬁcation of ApplicationLogic in Web Information Systems

24 Georgina Ram´ırez Camps (CWI)

Structural Fea-tures in XML Retrieval

25 Joost Schalken (VUA)

Empirical Investigations inSoftware Process Improvement

Agent-Based Simula-tion of Financial Markets: A modular, continuous-time approach

On Computer-AidedMethods for Modeling and Analysis of Organiza-tions

Optimizing hierarchicalmenus: a usage-based approach

Management of UncertainData: towards unattended integration

Modeling and simulatingcausal dependencies on process-aware informa-tion systems from a cost perspective

On the Application ofFormal Methods to Clinical Guidelines, an Artiﬁ-cial Intelligence Perspective

Supporting the tutor inthe design and support of adaptive e-learning

Bayesian Networks: Aspects ofApproximate Inference

The paradox of theguided user: assistance can be counter-effective

10 Wauter Bosma (UT)

Discourse oriented summa-rization

11 Vera Kartseva (VUA)

Designing Controls for Net-work Organizations: A Value-Based Approach

12 Jozsef Farkas (RUN)

A Semiotically Oriented Cog-nitive Model of Knowledge Representation

13 Caterina Carraciolo (UvA)

Topic Driven Access toScientiﬁc Handbooks

14 Arthur van Bunningen (UT)

Context-Aware Query-ing: Better Answers with Less Effort

15 Martijn van Otterlo (UT)

The Logic of AdaptiveBehavior: Knowledge Representation and Algo-rithms for the Markov Decision Process Frame-work in First-Order Domains

16 Henriette van Vugt (VUA)

Embodied agents froma user’s perspective

17 Martin Op ’t Land (TUD)

Applying Architectureand Ontology to the Splitting and Allying of Enter-prises

18 Guido de Croon (UM)

Adaptive Active Vision

19 Henning Rode (UT)

From Document to Entity Re-trieval: Improving Precision and Performance ofFocused Text Search

20 Rex Arendsen (UvA)

Geen bericht, goed bericht.Een onderzoek naar de effecten van de introductievan elektronisch berichtenverkeer met de overheidop de administratieve lasten van bedrijven

21 Krisztian Balog (UvA)

People Search in the Enter-prise

22 Henk Koning (UU)

Communication of IT-Architecture

23 Stefan Visscher (UU)

Bayesian network modelsfor the management of ventilator-associated pneu-monia

24 Zharko Aleksovski (VUA)

Using backgroundknowledge in ontology matching

25 Geert Jonker (UU)

Efﬁcient and Equitable Ex-change in Air Trafﬁc Management Plan Repairusing Spender-signed Currency

26 Marijn Huijbregts (UT)

Segmentation, Diarizationand Speech Transcription: Surprise Data Unrav-eled

27 Hubert Vogten (OU)

Design and ImplementationStrategies for IMS Learning Design

28 Ildiko Flesch (RUN)

On the Use of IndependenceRelations in Bayesian Networks

29 Dennis Reidsma (UT)

Annotations and Subjec-tive Machines: Of Annotators, Embodied Agents,Users, and Other Humans

30 Wouter van Atteveldt (VUA)

Semantic NetworkAnalysis: Techniques for Extracting, Representingand Querying Media Content

31 Loes Braun (UM)

Pro-Active Medical InformationRetrieval

32 Trung H. Bui (UT)

Toward Affective DialogueManagement using Partially Observable MarkovDecision Processes

33 Frank Terpstra (UvA)

Scientiﬁc Workﬂow Design:theoretical and practical issues

34 Jeroen de Knijf (UU)

Studies in Frequent TreeMining

35 Ben Torben Nielsen (UvT)

Dendritic morpholo-gies: function shapes structure

Symmetric Causal Inde-pendence Models

EvaluatingOntology-Alignment Techniques

A Framework for Evidence-basedPolicy Making Using IT

Improving the Qual-ity of Organisational Policy Making using Collab-oration Engineering

Bridging Supply and De-mand for Knowledge Intensive Tasks: Based onKnowledge, Cognition, and Quality

Understanding Classi-ﬁcation

Discriminative Vision-BasedRecovery and Recognition of Human Motion

Evolutionary Agent-BasedPolicy Analysis in Dynamic Environments

Design, Discovery andConstruction of Service-oriented Systems

10 Jan Wielemaker (UvA)

Logic programming forknowledge-intensive interactive applications

11 Alexander Boer (UvA)

Legal Theory, Sources ofLaw & the Semantic Web

12 Peter Massuthe (TUE, Humboldt-Universitaet zuBerlin)

Operating Guidelines for Services

13 Steven de Jong (UM)

Fairness in Multi-Agent Sys-tems

14 Maksym Korotkiy (VUA)

From ontology-enabledservices to service-enabled ontologies (making on-tologies work in e-science with ONTO-SOA)

15 Rinke Hoekstra (UvA)

Ontology Representation:Design Patterns and Ontologies that Make Sense

16 Fritz Reul (UvT)

New Architectures in ComputerChess

17 Laurens van der Maaten (UvT)

Feature Extractionfrom Visual Data

18 Fabian Groffen (CWI)

Armada, An EvolvingDatabase System

19 Valentin Robu (CWI)

Modeling Preferences,Strategic Reasoning and Collaboration in Agent-Mediated Electronic Markets

20 Bob van der Vecht (UU)

Adjustable Autonomy:Controling Inﬂuences on Decision Making

21 Stijn Vanderlooy (UM)

Ranking and Reliable Clas-siﬁcation

22 Pavel Serdyukov (UT)

Search For Expertise: Go-ing beyond direct evidence

23 Peter Hofgesang (VUA)

Modelling Web Usage ina Changing Environment

24 Annerieke Heuvelink (VUA)

Cognitive Models forTraining Simulations

25 Alex van Ballegooij (CWI)

RAM: Array DatabaseManagement through Relational Mapping

26 Fernando Koch (UU)

An Agent-Based Model forthe Development of Intelligent Mobile Services

27 Christian Glahn (OU)

Contextual Support of socialEngagement and Reﬂection on the Web

28 Sander Evers (UT)

Sensor Data Management withProbabilistic Models

29 Stanislav Pokraev (UT)

Model-Driven SemanticIntegration of Service-Oriented Applications

30 Marcin Zukowski (CWI)

Balancing vectorizedquery execution with bandwidth-optimized storage

31 Soﬁya Katrenko (UvA)

A Closer Look at LearningRelations from Text

32 Rik Farenhorst (VUA)

Architectural KnowledgeManagement: Supporting Architects and Auditors

33 Khiet Truong (UT)

How Does Real Affect AffectAffect Recognition In Speech?

34 Inge van de Weerd (UU)

Advancing in SoftwareProduct Management: An Incremental Method En-gineering Approach

35 Wouter Koelewijn (UL)

Privacy en Poli-tiegegevens: Over geautomatiseerde normatieveinformatie-uitwisseling

36 Marco Kalz (OUN)

Placement Support for Learn-ers in Learning Networks

37 Hendrik Drachsler (OUN)

Navigation Support forLearners in Informal Learning Networks

38 Riina Vuorikari (OU)

Tags and self-organisation:a metadata ecology for learning resources in amultilingual context

39 Christian Stahl (TUE, Humboldt-Universitaet zuBerlin)

Service Substitution: A Behavioral Ap-proach Based on Petri Nets

40 Stephan Raaijmakers (UvT)

Multinomial Lan-guage Learning: Investigations into the Geometryof Language

41 Igor Berezhnyy (UvT)

Digital Analysis of Paint-ings

42 Toine Bogers (UvT)

Recommender Systems forSocial Bookmarking

43 Virginia Nunes Leal Franqueira (UT)

FindingMulti-step Attacks in Computer Networks usingHeuristic Search and Mobile Ambients

44 Roberto Santana Tapia (UT)

Assessing Business-ITAlignment in Networked Organizations

45 Jilles Vreeken (UU)

Making Pattern Mining Use-ful

46 Loredana Afanasiev (UvA)

Querying XML: Bench-marks and Recursion

Patterns that Matter

Work ﬂows in Life Science

A Document EngineeringModel and Processing Framework for Multime-dia documents

Do You Know What I Know?Situational Awareness of Co-located Teams in Mul-tidisplay Environments

Predicting the Effectivenessof Queries and Retrieval Systems

Rapid Adaptation of VideoGame AI

Gesture interaction at a Dis-tance

Towards an Improved Reg-ulatory Framework of Free Software. Protectinguser freedoms in a world of software communitiesand eGovernments

A Politiele gegevensverwerk-ing en Privacy, Naar een effectieve waarborging

10 Rebecca Ong (UL)

Mobile Communication andProtection of Children

11 Adriaan Ter Mors (TUD)

The world according toMARP: Multi-Agent Route Planning

12 Susan van den Braak (UU)

Sensemaking softwarefor crime analysis

13 Gianluigi Folino (RUN)

High Performance DataMining using Bio-inspired techniques

14 Sander van Splunter (VUA)

Automated Web Ser-vice Reconﬁguration

15 Lianne Bodenstaff (UT)

Managing DependencyRelations in Inter-Organizational Models

16 Sicco Verwer (TUD)

Efﬁcient Identiﬁcation ofTimed Automata, theory and practice

17 Spyros Kotoulas (VUA)

Scalable Discovery of Net-worked Resources: Algorithms, Infrastructure, Ap-plications

18 Charlotte Gerritsen (VUA)

Caught in the Act: In-vestigating Crime by Agent-Based Simulation

19 Henriette Cramer (UvA)

People’s Responses toAutonomous and Adaptive Systems

20 Ivo Swartjes (UT)

Whose Story Is It Anyway? HowImprov Informs Agency and Authorship of Emer-gent Narrative

21 Harold van Heerde (UT)

Privacy-aware data man-agement by means of data degradation

22 Michiel Hildebrand (CWI)

End-user Support forAccess toHeterogeneous Linked Data

23 Bas Steunebrink (UU)

The Logical Structure ofEmotions

24 Zulﬁqar Ali Memon (VUA)

Modelling Human-Awareness for Ambient Agents: A Human Min-dreading Perspective

25 Ying Zhang (CWI)

XRPC: Efﬁcient DistributedQuery Processing on Heterogeneous XQuery En-gines

26 Marten Voulon (UL)

Automatisch contracteren

27 Arne Koopman (UU)

Characteristic RelationalPatterns

28 Stratos Idreos (CWI)

Database Cracking: TowardsAuto-tuning Database Kernels

29 Marieke van Erp (UvT)

Accessing Natural His-tory: Discoveries in data cleaning, structuring,and retrieval

30 Victor de Boer (UvA)

Ontology Enrichment fromHeterogeneous Sources on the Web

31 Marcel Hiel (UvT)

An Adaptive Service OrientedArchitecture: Automatically solving Interoperabil-ity Problems

32 Robin Aly (UT)

Modeling Representation Uncer-tainty in Concept-Based Multimedia Retrieval

33 Teduh Dirgahayu (UT)

Interaction Design in Ser-vice Compositions

34 Dolf Trieschnigg (UT)

Proof of Concept: Concept-based Biomedical Information Retrieval

35 Jose Janssen (OU)

Paving the Way for LifelongLearning: Facilitating competence developmentthrough a learning path speciﬁcation

36 Niels Lohmann (TUe)

Correctness of services andtheir composition

37 Dirk Fahland (TUe)

From Scenarios to compo-nents

38 Ghazanfar Farooq Siddiqui (VUA)

Integrativemodeling of emotions in virtual agents

39 Mark van Assem (VUA)

Converting and Integrat-ing Vocabularies for the Semantic Web

40 Guillaume Chaslot (UM)

Monte-Carlo TreeSearch

41 Sybren de Kinderen (VUA)

Needs-driven servicebundling in a multi-supplier setting: the computa-tional e3-service approach

42 Peter van Kranenburg (UU)

A Computational Ap-proach to Content-Based Retrieval of Folk SongMelodies

43 Pieter Bellekens (TUe)

An Approach towardsContext-sensitive and User-adapted Access to Het-erogeneous Data Sources, Illustrated in the Televi-sion Domain

44 Vasilios Andrikopoulos (UvT)

A theory and modelfor the evolution of software services

45 Vincent Pijpers (VUA) e3alignment: ExploringInter-Organizational Business-ICT Alignment

46 Chen Li (UT)

Mining Process Model Variants:Challenges, Techniques, Examples

47 Jahn-Takeshi Saito (UM)

Solving difﬁcult gamepositions

48 Bouke Huurnink (UvA)

Search in AudiovisualBroadcast Archives

49 Alia Khairia Amin (CWI)

Understanding andsupporting information seeking tasks in multiplesources

50 Peter-Paul van Maanen (VUA)

Adaptive Supportfor Human-Computer Teams: Exploring the Useof Cognitive Models of Trust and Attention

51 Edgar Meij (UvA)

Combining Concepts and Lan-guage Models for Information Access

Variational Algorithms forBayesian Inference in Latent Gaussian Models

Organizing Agent Organi-zations. Syntax and Operational Semantics of anOrganization-Oriented Programming Language

CompositionalDesign and Veriﬁcation of Component-Based In-formation Systems

Insights in ReinforcementLearning: Formal analysis and empirical evalua-tion of temporal-difference

Enterprise ArchitectureComing of Age: Increasing the Performance of anEmerging Discipline

Semantically-Enhanced Rec-ommendations in Cultural Heritage

Multimodal Information Presenta-tion for High Load Human Computer Interaction

BDI-based Generation ofRobust Task-Oriented Dialogues

Contextualised Mobile Mediafor Learning

10 Bart Bogaert (UvT)

Cloud Content Contention

11 Dhaval Vyas (UT)

Designing for Awareness: AnExperience-focused HCI Perspective

12 Carmen Bratosin (TUe)

Grid Architecture for Dis-tributed Process Mining

13 Xiaoyu Mao (UvT)

Airport under Control. Multi-agent Scheduling for Airport Ground Handling

14 Milan Lovric (EUR)

Behavioral Finance andAgent-Based Artiﬁcial Markets

15 Marijn Koolen (UvA)

The Meaning of Structure:the Value of Link Evidence for Information Re-trieval

16 Maarten Schadd (UM)

Selective Search in Gamesof Different Complexity

17 Jiyin He (UvA)

Exploring Topic Structure: Coher-ence, Diversity and Relatedness

18 Mark Ponsen (UM)

Strategic Decision-Making incomplex games

19 Ellen Rusman (OU)

The Mind’s Eye on PersonalProﬁles

20 Qing Gu (VUA)

Guiding service-oriented softwareengineering: A view-based approach

21 Linda Terlouw (TUD)

Modularization and Speciﬁ-cation of Service-Oriented Systems

22 Junte Zhang (UvA)

System Evaluation of ArchivalDescription and Access

23 Wouter Weerkamp (UvA)

Finding People andtheir Utterances in Social Media

24 Herwin van Welbergen (UT)

Behavior Genera-tion for Interpersonal Coordination with VirtualHumans On Specifying, Scheduling and RealizingMultimodal Virtual Human Behavior

25 Syed Waqar ul Qounain Jaffry (VUA)

Analysisand Validation of Models for Trust Dynamics

26 Matthijs Aart Pontier (VUA)

Virtual Agents forHuman Communication: Emotion Regulation andInvolvement-Distance Trade-Offs in EmbodiedConversational Agents and Robots

27 Aniel Bhulai (VUA)

Dynamic website optimiza-tion through autonomous management of designpatterns

28 Rianne Kaptein (UvA)

Effective Focused Retrievalby Exploiting Query Context and Document Struc-ture

29 Faisal Kamiran (TUe)

Discrimination-aware Clas-siﬁcation

30 Egon van den Broek (UT)

Affective Signal Process-ing (ASP): Unraveling the mystery of emotions

31 Ludo Waltman (EUR)

Computational and Game-Theoretic Approaches for Modeling Bounded Ra-tionality

32 Nees-Jan van Eck (EUR)

Methodological Ad-vances in Bibliometric Mapping of Science

33 Tom van der Weide (UU)

Arguing to Motivate De-cisions

34 Paolo Turrini (UU)

Strategic Reasoning in Interde-pendence: Logical and Game-theoretical Investi-gations

35 Maaike Harbers (UU)

Explaining Agent Behaviorin Virtual Training

36 Erik van der Spek (UU)

Experiments in seriousgame design: a cognitive approach

37 Adriana Burlutiu (RUN)

Machine Learning forPairwise Data, Applications for Preference Learn-ing and Supervised Network Inference

38 Nyree Lemmens (UM)

Bee-inspired DistributedOptimization

39 Joost Westra (UU)

Organizing Adaptation usingAgents in Serious Games

40 Viktor Clerc (VUA)

Architectural KnowledgeManagement in Global Software Development

41 Luan Ibraimi (UT)

Cryptographically EnforcedDistributed Data Access Control

42 Michal Sindlar (UU)

Explaining Behavior throughMental State Attribution

43 Henk van der Schuur (UU)

Process Improvementthrough Software Operation Knowledge

44 Boris Reuderink (UT)

Robust Brain-Computer In-terfaces

45 Herman Stehouwer (UvT)

Statistical LanguageModels for Alternative Sequence Selection

46 Beibei Hu (TUD)

Towards Contextualized Infor-mation Delivery: A Rule-based Architecture forthe Domain of Mobile Police Work

47 Azizi Bin Ab Aziz (VUA)

Exploring Computa-tional Models for Intelligent Support of Personswith Depression

48 Mark Ter Maat (UT)

Response Selection and Turn-taking for a Sensitive Artiﬁcial Listening Agent

49 Andreea Niculescu (UT)

Conversational inter-faces for task-oriented spoken dialogues: designaspects inﬂuencing interaction quality

Relationship Marketing forSMEs in Uganda

Adaptivity, emotion,and Rationality in Human and Ambient Agent Mod-els

Supporting Architecture Evo-lution by Mining Software Repositories

Development of Content Man-agement System-based Web Applications

Maturing InterorganisationalInformation Systems

Awareness Support forKnowledge Workers in Research Networks

When the GoingGets Tough: Exploring Agent-based Models of Hu-man Performance under Demanding Conditions

Kernel Methods for VesselTrajectories

Trust and Privacy Manage-ment Support for Context-Aware Service Platforms

10 David Smits (TUe)

Towards a Generic DistributedAdaptive Hypermedia Environment

11 J. C. B. Rantham Prabhakara (TUe)

Process Min-ing in the Large: Preprocessing, Discovery, andDiagnostics

12 Kees van der Sluijs (TUe)

Model Driven Designand Data Integration in Semantic Web InformationSystems

13 Suleman Shahid (UvT)

Fun and Face: Exploringnon-verbal expressions of emotion during playfulinteractions

14 Evgeny Knutov (TUe)

Generic Adaptation Frame-work for Unifying Adaptive Web-based Systems

15 Natalie van der Wal (VUA)

Social Agents. Agent-Based Modelling of Integrated Internal and SocialDynamics of Cognitive and Affective Processes

16 Fiemke Both (VUA)

Helping people by under-standing them: Ambient Agents supporting taskexecution and depression treatment

17 Amal Elgammal (UvT)

Towards a ComprehensiveFramework for Business Process Compliance

18 Eltjo Poort (VUA)

Improving Solution Architect-ing Practices

19 Helen Schonenberg (TUe)

What’s Next? Opera-tional Support for Business Process Execution

20 Ali Bahramisharif (RUN)

Covert Visual SpatialAttention, a Robust Paradigm for Brain-ComputerInterfacing

21 Roberto Cornacchia (TUD)

Querying Sparse Ma-trices for Information Retrieval

22 Thijs Vis (UvT)

Intelligence, politie en veilighei-dsdienst: verenigbare grootheden?

23 Christian Muehl (UT)

Toward Affective Brain-Computer Interfaces: Exploring the Neurophys-iology of Affect during Human Media Interaction

24 Laurens van der Werff (UT)

Evaluation of NoisyTranscripts for Spoken Document Retrieval

25 Silja Eckartz (UT)

Managing the Business CaseDevelopment in Inter-Organizational IT Projects:A Methodology and its Application

26 Emile de Maat (UvA)

Making Sense of Legal Text

27 Hayrettin Gurkok (UT)

Mind the Sheep! User Ex-perience Evaluation & Brain-Computer InterfaceGames

28 Nancy Pascall (UvT)

Engendering Technology Em-powering Women

29 Almer Tigelaar (UT)

Peer-to-Peer Information Re-trieval

30 Alina Pommeranz (TUD)

Designing Human-Centered Systems for Reﬂective Decision Making

31 Emily Bagarukayo (RUN)

A Learning by Con-struction Approach for Higher Order CognitiveSkills Improvement, Building Capacity and Infras-tructure

32 Wietske Visser (TUD)

Qualitative multi-criteriapreference representation and reasoning

33 Rory Sie (OUN)

Coalitions in Cooperation Net-works (COCOON)

34 Pavol Jancura (RUN)

Evolutionary analysis in PPInetworks and applications

35 Evert Haasdijk (VUA)

Never Too Old To Learn:On-line Evolution of Controllers in Swarm- andModular Robotics

36 Denis Ssebugwawo (RUN)

Analysis and Evalua-tion of Collaborative Modeling Processes

37 Agnes Nakakawa (RUN)

A Collaboration Processfor Enterprise Architecture Creation

38 Selmar Smit (VUA)

Parameter Tuning and Scien-tiﬁc Testing in Evolutionary Algorithms

39 Hassan Fatemi (UT)

Risk-aware design of valueand coordination networks

40 Agus Gunawan (UvT)

Information Access forSMEs in Indonesia

41 Sebastian Kelle (OU)

Game Design Patterns forLearning

42 Dominique Verpoorten (OU)

Reﬂection Ampliﬁersin self-regulated Learning

43 Anna Tordai (VUA)

On Combining AlignmentTechniques

44 Benedikt Kratz (UvT)

A Model and Language forBusiness-aware Transactions

45 Simon Carter (UvA)

Exploration and Exploitationof Multilingual Data for Statistical Machine Trans-lation

46 Manos Tsagkias (UvA)

Mining Social Media:Tracking Content and Predicting Behavior

47 Jorn Bakker (TUe)

Handling Abrupt Changes inEvolving Time-series Data

48 Michael Kaisers (UM)

Learning against Learning:Evolutionary dynamics of reinforcement learningalgorithms in strategic interactions

49 Steven van Kervel (TUD)

Ontologogy driven En-terprise Information Systems Engineering

50 Jeroen de Jong (TUD)

Heuristics in DynamicSceduling: a practical framework with a casestudy in elevator dispatching

News Analytics for FinancialDecision Support

MonetDB/DataCell: Lever-aging the Column-store Database Technology forEfﬁcient and Scalable Stream Processing

Reasoning with Contextsin Description Logics

Coordinating autonomousplanning and scheduling

Groupware RequirementsEvolutions Patterns

The Data Cyclotron:Juggling Data and Queries for a Data WarehouseAudience

Quantifying IndividualPlayer Differences

Making enemies: cogni-tive modeling for opponent agents in ﬁghter pilotsimulators

Metagenomic Data Analysis:Computational Methods and Applications

10 Jeewanie Jayasinghe Arachchige (UvT)

A UniﬁedModeling Framework for Service Design

11 Evangelos Pournaras (TUD)

Multi-level Reconﬁg-urable Self-organization in Overlay Services

12 Marian Razavian (VUA)

Knowledge-driven Mi-gration to Services

13 Mohammad Saﬁri (UT)

Service Tailoring: User-centric creation of integrated IT-based homecareservices to support independent living of elderly

14 Jafar Tanha (UvA)

Ensemble Approaches to Semi-Supervised Learning Learning

15 Daniel Hennes (UM)

Multiagent Learning: Dy-namic Games and Applications

16 Eric Kok (UU)

Exploring the practical beneﬁts ofargumentation in multi-agent deliberation

17 Koen Kok (VUA)

The PowerMatcher: Smart Co-ordination for the Smart Electricity Grid

18 Jeroen Janssens (UvT)

Outlier Selection and One-Class Classiﬁcation

19 Renze Steenhuizen (TUD)

Coordinated Multi-Agent Planning and Scheduling

20 Katja Hofmann (UvA)

Fast and Reliable OnlineLearning to Rank for Information Retrieval

21 Sander Wubben (UvT)

Text-to-text generation bymonolingual machine translation

22 Tom Claassen (RUN)

Causal Discovery and Logic

23 Patricio de Alencar Silva (UvT)

Value ActivityMonitoring

24 Haitham Bou Ammar (UM)

Automated Transferin Reinforcement Learning

25 Agnieszka Anna Latoszek-Berendsen (UM)

Intention-based Decision Support. A new way ofrepresenting and implementing clinical guidelinesin a Decision Support System

26 Alireza Zarghami (UT)

Architectural Support forDynamic Homecare Service Provisioning

27 Mohammad Huq (UT)

Inference-based Frame-work Managing Data Provenance

28 Frans van der Sluis (UT)

When Complexity be-comes Interesting: An Inquiry into the InformationeXperience

29 Iwan de Kok (UT)

Listening Heads

30 Joyce Nakatumba (TUe)

Resource-Aware BusinessProcess Management: Analysis and Support

31 Dinh Khoa Nguyen (UvT)

Blueprint Model andLanguage for Engineering Cloud Applications

32 Kamakshi Rajagopal (OUN)

Networking ForLearning: The role of Networking in a LifelongLearner’s Professional Development

33 Qi Gao (TUD)

User Modeling and Personalizationin the Microblogging Sphere

34 Kien Tjin-Kam-Jet (UT)

Distributed Deep WebSearch

35 Abdallah El Ali (UvA)

Minimal Mobile HumanComputer Interaction

36 Than Lam Hoang (TUe)

Pattern Mining in DataStreams

37 Dirk B¨orner (OUN)

Ambient Learning Displays

38 Eelco den Heijer (VUA)

Autonomous Evolution-ary Art

39 Joop de Jong (TUD)

A Method for Enterprise On-tology based Design of Enterprise Information Sys-tems

40 Pim Nijssen (UM)

Monte-Carlo Tree Search forMulti-Player Games

41 Jochem Liem (UvA)

Supporting the ConceptualModelling of Dynamic Systems: A Knowledge En-gineering Perspective on Qualitative Reasoning

42 L´eon Planken (TUD)

Algorithms for Simple Tem-poral Reasoning

43 Marc Bron (UvA)

Exploration and Contextualiza-tion through Interaction and Concepts

Studies in Learning MonotoneModels from Data

Combining System Dynam-ics with a Domain Modeling Method

Information Re-trieval for Children: Search Behavior and Solu-tions

Websites for chil-dren: search strategies and interface design -Three studies on children’s search performanceand evaluation

Knowledge Perspectiveson Advancing Dynamic Capability

Supporting NetworkedSoftware Development

Aligning Observed andModeled Behavior

Data Integration over Dis-tributed and Heterogeneous Data Endpoints

Toward Human-Level Artiﬁ-cial Intelligence: Representation and Computationof Meaning in Natural Language

10 Ivan Salvador Razo Zapata (VUA)

Service ValueNetworks

11 Janneke van der Zwaan (TUD)

An Empathic Vir-tual Buddy for Social Support

12 Willem van Willigen (VUA)

Look Ma, No Hands:Aspects of Autonomous Vehicle Control

13 Arlette van Wissen (VUA)

Agent-Based Supportfor Behavior Change: Models and Applications inHealth and Safety Domains

14 Yangyang Shi (TUD)

Language Models With Meta-information

15 Natalya Mogles (VUA)

Agent-Based Analysis andSupport of Human Functioning in Complex Socio-Technical Systems: Applications in Safety andHealthcare

16 Krystyna Milian (VUA)

Supporting trial recruit-ment and design by automatically interpreting eli-gibility criteria

17 Kathrin Dentler (VUA)

Computing healthcarequality indicators automatically: Secondary Useof Patient Data and Semantic Interoperability

18 Mattijs Ghijsen (UvA)

Methods and Models forthe Design and Study of Dynamic Agent Organiza-tions

19 Vinicius Ramos (TUe)

Adaptive HypermediaCourses: Qualitative and Quantitative Evaluationand Tool Support

20 Mena Habib (UT)

Named Entity Extraction andDisambiguation for Informal Text: The MissingLink

21 Kassidy Clark (TUD)

Negotiation and Monitoringin Open Environments

22 Marieke Peeters (UU)

Personalized EducationalGames: Developing agent-supported scenario-based training

23 Eleftherios Sidirourgos (UvA/CWI)

Space Efﬁ-cient Indexes for the Big Data Era

24 Davide Ceolin (VUA)

Trusting Semi-structuredWeb Data

25 Martijn Lappenschaar (RUN)

New network modelsfor the analysis of disease interaction

26 Tim Baarslag (TUD)

What to Bid and When toStop

27 Rui Jorge Almeida (EUR)

Conditional DensityModels Integrating Fuzzy and Probabilistic Repre-sentations of Uncertainty

28 Anna Chmielowiec (VUA)

Decentralized k-CliqueMatching

29 Jaap Kabbedijk (UU)

Variability in Multi-TenantEnterprise Software

30 Peter de Cock (UvT)

Anticipating Criminal Be-haviour

31 Leo van Moergestel (UU)

Agent Technology inAgile Multiparallel Manufacturing and ProductSupport

32 Naser Ayat (UvA)

On Entity Resolution in Proba-bilistic Data

33 Tesfa Tegegne (RUN)

Service Discovery ineHealth

34 Christina Manteli (VUA)

The Effect of Governancein Global Software Development: Analyzing Trans-active Memory Systems

35 Joost van Ooijen (UU)

Cognitive Agents in VirtualWorlds: A Middleware Design Approach

36 Joos Buijs (TUe)

Flexible Evolutionary Algo-rithms for Mining Structured Process Models

37 Maral Dadvar (UT)

Experts and Machines UnitedAgainst Cyberbullying

38 Danny Plass-Oude Bos (UT)

Making brain-computer interfaces better: improving usabilitythrough post-processing

39 Jasmina Maric (UvT)

Web Communities, Immigra-tion, and Social Capital

40 Walter Omona (RUN)

A Framework for Knowl-edge Management Using ICT in Higher Education

41 Frederic Hogenboom (EUR)

Automated Detectionof Financial Events in News Text

42 Carsten Eijckhof (CWI/TUD)

Contextual Multidi-mensional Relevance Models

43 Kevin Vlaanderen (UU)

Supporting Process Im-provement using Method Increments

44 Paulien Meesters (UvT)

Intelligent Blauw:Intelligence-gestuurde politiezorg in gebiedsge-bonden eenheden

45 Birgit Schmitz (OUN)

Mobile Games for Learn-ing: A Pattern-Based Approach

46 Ke Tao (TUD)

Social Web Data Analytics: Rele-vance, Redundancy, Diversity

47 Shangsong Liang (UvA)

Fusion and Diversiﬁca-tion in Information Retrieval

Machine Learning for Rele-vance of Information in Crisis Response

Smart auditing: InnovativeCompliance Checking in Customs Controls

Machine learning fornetwork data

Collaborations in OpenLearning Environments

Cryptographically En-forced Search Pattern Hiding

Business Process Qual-ity Computation: Computing Non-Functional Re-quirements to Improve Business Processes

Time-Aware OnlineReputation Analysis

Organizational Compliance: Anagent-based model for designing and evaluatingorganizational interactions

HCI Perspectives on Behav-ior Change Support Systems

10 Henry Hermans (OUN)

OpenU: design of an inte-grated system to support lifelong learning

11 Yongming Luo (TUe)

Designing algorithms forbig graph datasets: A study of computing bisimu-lation and joins

12 Julie M. Birkholz (VUA)

Modi Operandi of So-cial Network Dynamics: The Effect of Context onScientiﬁc Collaboration Networks

13 Giuseppe Procaccianti (VUA)

Energy-EfﬁcientSoftware

14 Bart van Straalen (UT)

A cognitive approach tomodeling bad news conversations

15 Klaas Andries de Graaf (VUA)

Ontology-basedSoftware Architecture Documentation

16 Changyun Wei (UT)

Cognitive Coordination forCooperative Multi-Robot Teamwork

17 Andr´e van Cleeff (UT)

Physical and Digital Secu-rity Mechanisms: Properties, Combinations andTrade-offs

18 Holger Pirk (CWI)

Waste Not, Want Not!: Manag-ing Relational Data in Asymmetric Memories

19 Bernardo Tabuenca (OUN)

Ubiquitous Technologyfor Lifelong Learners

20 Lo¨ıs Vanh´ee (UU)

Using Culture and Values toSupport Flexible Coordination

21 Sibren Fetter (OUN)

Using Peer-Support to Ex-pand and Stabilize Online Learning

22 Zhemin Zhu (UT)

Co-occurrence Rate Networks

23 Luit Gazendam (VUA)

Cataloguer Support in Cul-tural Heritage

24 Richard Berendsen (UvA)

Finding People, Papers,and Posts: Vertical Search Algorithms and Evalu-ation

25 Steven Woudenberg (UU)

Bayesian Tools forEarly Disease Detection

26 Alexander Hogenboom (EUR)

Sentiment Analysisof Text Guided by Semantics and Structure

27 S´andor H´eman (CWI)

Updating compressedcolomn stores

28 Janet Bagorogoza (TiU)

KNOWLEDGE MAN-AGEMENT AND HIGH PERFORMANCE: TheUganda Financial Institutions Model for HPO

29 Hendrik Baier (UM)

Monte-Carlo Tree Search En-hancements for One-Player and Two-Player Do-mains

30 Kiavash Bahreini (OU)

Real-time MultimodalEmotion Recognition in E-Learning

31 Yakup Koc¸ (TUD)

On the robustness of PowerGrids

32 Jerome Gard (UL)

Corporate Venture Manage-ment in SMEs

33 Frederik Schadd (TUD)

Ontology Mapping withAuxiliary Resources

34 Victor de Graaf (UT)

Gesocial Recommender Sys-tems

35 Jungxao Xu (TUD)

Affective Body Language ofHumanoid Robots: Perception and Effects in Hu-man Robot Interaction

Recognition of Shapesby Humans and Machines

Optimizingmedication reviews through decision support: pre-scribing a better pill to swallow

Knowledge Work in Context:User Centered Knowledge Worker Support

Publishing and Consum-ing Linked Data

Expanded AcyclicQueries: Containment and an Application in Ex-plaining Missing Answers

Robust scheduling in anuncertain environment

Measuring and modelingnegative emotions for virtual training

A Link to the Past: Con-structing Historical Social Networks from Unstruc-tured Data

Trusting Crowd-sourced Information on Cultural Artefacts

10 George Karafotias (VUA)

Parameter Control forEvolutionary Algorithms

11 Anne Schuth (UvA)

Search Engines that Learnfrom Their Users

12 Max Knobbout (UU)

Logics for Modelling andVerifying Normative Multi-Agent Systems

13 Nana Baah Gyan (VUA)

The Web, Speech Tech-nologies and Rural Development in West Africa:An ICT4D Approach

14 Ravi Khadka (UU)

Revisiting Legacy Software Sys-tem Modernization

15 Steffen Michels (RUN)

Hybrid Probabilistic Log-ics: Theoretical Aspects, Algorithms and Experi-ments

16 Guangliang Li (UvA)

Socially Intelligent Au-tonomous Agents that Learn from Human Reward

17 Berend Weel (VUA)

Towards Embodied Evolutionof Robot Organisms

18 Albert Mero˜no Pe˜nuela (VUA)

Reﬁning StatisticalData on the Web

19 Julia Efremova (Tu/e)

Mining Social Structuresfrom Genealogical Data

20 Daan Odijk (UvA)

Context & Semantics in News& Web Search

21 Alejandro Moreno C´elleri (UT)

From Traditionalto Interactive Playspaces: Automatic Analysis ofPlayer Behavior in the Interactive Tag Playground

22 Grace Lewis (VUA)

Software Architecture Strate-gies for Cyber-Foraging Systems

23 Fei Cai (UvA)

Query Auto Completion in Informa-tion Retrieval

24 Brend Wanders (UT)

Repurposing and Probabilis-tic Integration of Data: An Iterative and datamodel independent approach

25 Julia Kiseleva (TU/e)

Using Contextual Informa-tion to Understand Searching and Browsing Be-havior

26 Dilhan Thilakarathne (VUA)

In or Out of Control:Exploring Computational Models to Study the Roleof Human Awareness and Control in BehaviouralChoices, with Applications in Aviation and EnergyManagement Domains

27 Wen Li (TUD)

Understanding Geo-spatial Infor-mation on Social Media

28 Mingxin Zhang (TUD)

Large-scale Agent-basedSocial Simulation: A study on epidemic predictionand control

29 Nicolas H¨oning (TUD)

Peak reduction in decen-tralised electricity systems -Markets and prices forﬂexible planning

30 Ruud Mattheij (UvT)

The Eyes Have It

31 Mohammad Khelghati (UT)

Deep web contentmonitoring

32 Eelco Vriezekolk (UT)

Assessing Telecommunica-tion Service Availability Risks for Crisis Organisa-tions

33 Peter Bloem (UvA)

Single Sample Statistics, exer-cises in learning from just one example

34 Dennis Schunselaar (TUe)

Conﬁgurable ProcessTrees: Elicitation, Analysis, and Enactment

35 Zhaochun Ren (UvA)

Monitoring Social Media:Summarization, Classiﬁcation and Recommenda-tion

36 Daphne Karreman (UT)

Beyond R2D2: The de-sign of nonverbal interaction behavior optimizedfor robot-speciﬁc morphologies

37 Giovanni Sileno (UvA)

Aligning Law and Action:a conceptual and computational inquiry

38 Andrea Minuto (UT)

MATERIALS THAT MAT-TER: Smart Materials meet Art & Interaction De-sign

39 Merijn Bruijnes (UT)

Believable Suspect Agents:Response and Interpersonal Style Selection for anArtiﬁcial Suspect

40 Christian Detweiler (TUD)

Accounting for Valuesin Design

41 Thomas King (TUD)

Governing Governance: AFormal Framework for Analysing Institutional De-sign and Enactment Governance

42 Spyros Martzoukos (UvA)

Combinatorial andCompositional Aspects of Bilingual Aligned Cor-pora

43 Saskia Koldijk (RUN)

Context-Aware Support forStress Self-Management: From Theory to Practice

44 Thibault Sellam (UvA)

Automatic Assistants forDatabase Exploration

45 Bram van de Laar (UT)

Experiencing Brain-Computer Interface Control

46 Jorge Gallego Perez (UT)

Robots to Make youHappy

47 Christina Weber (UL)

Real-time foresight: Pre-paredness for dynamic innovation networks

48 Tanja Buttler (TUD)

Collecting Lessons Learned

49 Gleb Polevoy (TUD)

Participation and Interactionin Projects. A Game-Theoretic Analysis

50 Yan Wang (UVT)

The Bridge of Dreams: Towardsa Method for Operational Performance Alignmentin IT-enabled Service Supply Chains

Investigating Cyber-crime

Designing and Understand-ing Forensic Bayesian Networks using Argumenta-tion

Grid Manufactur-ing: A Cyber-Physical Approach with AutonomousProducts and Reconﬁgurable Manufacturing Ma-chines

MULTI-CORE PARAL-LELISM IN A COLUMN-STORE

Collaboration Behavior

Intelligent Information Sys-tems for Web Product Search

Insight in Information: fromAbstract to Anomaly

Detecting Interesting Differ-ences:Data Mining in Health Insurance Data us-ing Outlier Detection and Subgroup Discovery

Text as Social and CulturalData: A Computational Perspective on Variationin Text

10 Robby van Delden (UT) (Steering) InteractivePlay Behavior

11 Florian Kunneman (RUN)

Modelling patterns oftime and emotion in Twitter

12 Sander Leemans (UT)

Robust Process Mining withGuarantees

13 Gijs Huisman (UT)

Social Touch Technology: Ex-tending the reach of social touch through haptictechnology

14 Shoshannah Tekofsky (UvT)

You Are Who YouPlay You Are: Modelling Player Traits from VideoGame Behavior

15 Peter Berck, Radboud University (RUN)

Memory-Based Text Correction

16 Aleksandr Chuklin (UvA)

Understanding andModeling Users of Modern Search Engines

17 Daniel Dimov (VUA)

Crowdsourced Online Dis-pute Resolution

18 Ridho Reinanda (UvA)

Entity Associations forSearch

19 Jeroen Vuurens (TUD)

Proximity of Terms, Textsand Semantic Vectors in Information Retrieval

20 Mohammadbashir Sedighi (TUD)

Fostering En-gagement in Knowledge Sharing: The Role of Per-ceived Beneﬁts, Costs and Visibility

21 Jeroen Linssen (UT)

Meta Matters in Interac-tive Storytelling and Serious Gaming (A Play onWorlds)

22 Sara Magliacane (VUA)

Logics for causal infer-ence under uncertainty

23 David Graus (UvA)

Entities of Interest — Discov-ery in Digital Traces amenvatting

Tegenwoordig laten we continu — en soms zonder het te weten — digitale sporen achter,door online content te browsen, sharen , liken , doorzoeken, bekijken, of beluisteren.Geaggregeerd kunnen deze digitale sporen waardevolle inzichten bieden in ons gedrag,onze voorkeuren, bezigheden, en eigenschappen. Terwijl veel mensen zich zorgen makenom de hiermee gepaarde bedreiging van onze privacy, heeft het op grote schaal verzamelenen gebruiken van onze digitale sporen ook veel goeds gebracht. Bijvoorbeeld de toegangtot ongekende hoeveelheden informatie en kennis die de zoekmachines — al lerendevan hun gebruikers — ons verschaffen, of het ontdekken van nieuwe bijwerkingen vanmedicijnen door de zoekopdrachten die bij een zoekmachine binnenkomen te analyseren.Of het nu in online diensten, journalistiek, digitaal sporenonderzoek, of wetenschapis, men richt zich steeds meer op digitale sporen om nieuwe informatie te ontdekken.Neem bijvoorbeeld het Enron schandaal, de controverse rond het emailgebruik van HillaryClinton, of de Panama Papers: voorbeelden van zaken die draaien rond het analyseren,doorzoeken, onderzoeken, en binnenstebuiten keren van grote hoeveelheden digitalesporen om tot nieuwe inzichten, kennis, en informatie te komen.Dit proefschrift gaat over het ontdekkingsproces in grootschalige verzamelingendigitale sporen. Het proefschrift bevindt zich op de kruising van zoekmachinetechnologie,taaltechnologie, en toegepaste machine learning , en presenteert computationele methodendie ondersteuning bieden aan het doorzoeken en inzichtelijk maken van grootschaligeverzamelingen digitale sporen. We richten ons op tekstuele digitale sporen, zoals emails,en social media. Daarnaast richten we ons op twee aspecten die centraal staan in hetontdekkingsproces.Allereerst richten we ons op de inhoud van digitale sporen. Onze onderzoekssubjectenzijn de personen, plaatsen, en organisaties die worden genoemd in tekstuele digitalesporen. Deze zogenaamde real-world entities zijn van centraal belang in het doorzoekenen inzichtelijk maken van de inhoud van digitale sporen. In dit deel richten we ons op hetanalyseren, herkennen, en beter vindbaar maken van nieuwe entiteiten die opkomen indigitale sporen.In het tweede deel richten we ons op de producenten van digitale sporen. Niet deinhoud, maar de context waarin de digitale sporen tot stand komen staat hierbij centraal.Ons doel is toekomstige activiteiten te voorspellen door gebruik te maken van digitalesporen. We tonen aan dat we de ontvanger van een email kunnen voorspellen doorgebruik te maken van de communicatie netwerken van emailverkeer. Daarnaast tonenwe aan dat we kunnen voorspellen wanneer iemand van plan is een bepaalde taak uit tevoeren, door gebruik te maken van geaggregeerde interactie-data met een persoonlijkeassistent-applicatie. 173 ummary In the era of big data, we continuously — and at times unknowingly — leave behinddigital traces, by browsing, sharing, posting, liking, searching, watching, and listeningto online content. When aggregated, these digital traces can provide powerful insightsinto the behavior, preferences, activities, and traits of people. While many have raisedprivacy concerns around the use of aggregated digital traces, it has undisputedly broughtus many advances, from the search engines that learn from their users and enable ouraccess to unforeseen amounts of data, knowledge, and information, to, e.g., the discoveryof previously unknown adverse drug reactions from search engine logs.Whether in online services, journalism, digital forensics, law, or research, we increas-ingly set out to exploring large amounts of digital traces to discover new information.Consider for instance, the Enron scandal, Hillary Clinton’s email controversy, or thePanama papers: cases that revolve around analyzing, searching, investigating, exploring,and turning upside down large amounts of digital traces to gain new insights, knowledge,and information. This discovery task is at its core about “ﬁnding evidence of activity inthe real world.”This dissertation revolves around discovery in digital traces , and sits at the intersectionof Information Retrieval, Natural Language Processing, and applied Machine Learning.We propose computational methods that aim to support the exploration and sense-makingprocess of large collections of digital traces. We focus on textual traces, e.g., emailsand social media streams, and address two aspects that are central to discovery in digitaltraces.In the ﬁrst part, we focus on the textual content of digital traces. Here, our entities ofinterest are the people, places, and organizations that appear in digital traces. These so-called real-world entities are central to enabling exploratory search in digital traces. Morespeciﬁcally, we analyze, predict and show how to improve retrieval of newly emerging entities, i.e., previously unknown entities that surface in digital traces.In the second part, our entities of interest are the people who leave behind the digitaltraces. Here, we focus on the contexts in which digital traces are created, and aim topredict users’ future activity, by leveraging their historic digital traces. We show we canpredict email recipients by leveraging the email communication network, and we showwe can predict when users plan to execute activities, by leveraging logs of interactionswith an intelligent assistant.

NTITIES OF INTERESTDISCOVERY IN DIGITAL TRACES