Rianne Kaptein | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rianne Kaptein is active.

Explore More

Publication

Featured researches published by Rianne Kaptein.

Artificial Intelligence | 2013

Exploiting the category structure of Wikipedia for entity ranking

Rianne Kaptein; Jaap Kamps

The Web has not only grown in size, but also changed its character, due to collaborative content creation and an increasing amount of structure. Current Search Engines find Web pages rather than information or knowledge, and leave it to the searchers to locate the sought information within the Web page. A considerable fraction of Web searches contains named entities. We focus on how the Wikipedia structure can help rank relevant entities directly in response to a search request, rather than retrieve an unorganized list of Web pages with relevant but also potentially redundant information about these entities. Our results demonstrate the benefits of using topical and link structure over the use of shallow statistics. Our main findings are the following. First, we examine whether Wikipedia category and link structure can be used to retrieve entities inside Wikipedia as is the goal of the INEX (Initiative for the Evaluation of XML retrieval) Entity Ranking task. Category information proves to be a highly effective source of information, leading to large and significant improvements in retrieval performance on all data sets. Secondly, we study how we can use category information to retrieve documents for ad hoc retrieval topics in Wikipedia. We study the differences between entity ranking and ad hoc retrieval in Wikipedia by analyzing the relevance assessments. Considering retrieval performance, also on ad hoc retrieval topics we achieve significantly better results by exploiting the category information. Finally, we examine whether we can automatically assign target categories to ad hoc and entity ranking queries. Guessed categories lead to performance improvements that are not as large as when the categories are assigned manually, but they are still significant. We conclude that the category information in Wikipedia is a useful source of information that can be used for entity ranking as well as other retrieval tasks.

international acm sigir conference on research and development in information retrieval | 2012

Effective focused retrieval by exploiting query context and document structure

Rianne Kaptein

The classic IR model of the search process consists of three elements: query, documents and search results. A user looking to fulfil an information need formulates a query usually consisting of a small set of keywords summarising the information need. The goal of an IR system is to retrieve documents containing information which might be useful or relevant to the user. Throughout the search process there is a loss of focus, because keyword queries entered by users often do not suitably summarise their complex information needs, and IR systems do not sufficiently interpret the contents of documents, leading to result lists containing irrelevant and redundant information. The main research objective of this thesis is to exploit query context and document structure to provide for more focused retrieval. The short keyword query used as input to the retrieval system can be supplemented with topic categories from structured Web resources such as DMOZ and Wikipedia. Topic categories can be used as query context to retrieve documents that are not only relevant to the query but also belong to a relevant topic category. Category information is especially useful for the task of entity ranking where the user is searching for a certain type of entity such as companies or persons. Category information can help to improve the search results by promoting in the ranking pages belonging to relevant topic categories, or categories similar to the relevant categories. By following external links and searching for the retrieved Wikipedia entities in a general Web collection, we can also exploit the structure of Wikipedia to rank entities on the general Web. Wikipedia, in contrast to the general Web, does not contain much redundant information. This absence of redundant information can be exploited by using Wikipedia as a pivot to search the general Web. A typical query returns thousands or millions of documents, but searchers hardly ever look beyond the first result page. Since space on the result page is limited, we can show only a few documents in the result list. Word clouds can be used to summarise groups of documents into a set of keywords which allows users to quickly get a grasp on the underlying data. Instead of using user-assigned tags we generate word clouds from the textual contents of documents themselves as well as the anchor text ofWeb documents. Improvements over word clouds that are created using simple term frequency counting include using a parsimonious term weighting scheme, including bigrams and biasing the word cloud towards the query. We find that word clouds can to a certain degree quickly convey the topic and relevance of a set of search results. Available online at: http://dare.uva.nl/record/39569.

international acm sigir conference on research and development in information retrieval | 2009

Who said what to whom?: capturing the structure of debates

Rianne Kaptein; Maarten Marx; Jaap Kamps

Transcripts of meetings are a document genre characterized by a complex narrative structure. The essence is not only what is said, but also by who and to whom. This paper investigates whether we can use semantic annotations like the speaker in order to capture this debate structure, as well as the related content of the debate. The structure is visualized in a graph, while the content is condensed into word clouds, that are created using a parsimonious language model. Evaluation shows that both tools adequately capture the structure and content of the debate at an aggregated level.

european conference on information retrieval | 2010

How different are language models andword clouds

Rianne Kaptein; Djoerd Hiemstra; Jaap Kamps

Word clouds are a summarised representation of a document’s text, similar to tag clouds which summarise the tags assigned to documents. Word clouds are similar to language models in the sense that they represent a document by its word distribution. In this paper we investigate the differences between word cloud and language modelling approaches, and specifically whether effective language modelling techniques also improve word clouds. We evaluate the quality of the language model using a system evaluation test bed, and evaluate the quality of the resulting word cloud with a user study. Our experiments show that different language modelling techniques can be applied to improve a standard word cloud that uses a TF weighting scheme in combination with stopword removal. Including bigrams in the word clouds and a parsimonious term weighting scheme are the most effective in both the system evaluation and the user study.

Lecture Notes in Computer Science | 2009

Finding Entities in Wikipedia Using Links and Categories

Rianne Kaptein; Jaap Kamps

In this paper we describe our participation in the INEX Entity Ranking track. We explored the relations between Wikipedia pages, categories and links. Our approach is to exploit both category and link information. Category information is used by calculating distances between document categories and target categories. Link information is used for relevance propagation and in the form of a document link prior. Both sources of information have value, but using category information leads to the biggest improvements.

international acm sigir conference on research and development in information retrieval | 2010

Linking wikipedia to the web

Rianne Kaptein; Pavel Serdyukov; Jaap Kamps

We investigate the task of finding links from Wikipedia pages to external web pages. Such external links significantly extend the information in Wikipedia with information from the Web at large, while retaining the encyclopedic organization of Wikipedia. We use a language modeling approach to create a full-text and anchor text runs, and experiment with different document priors. In addition we explore whether social bookmarking site Delicious can be exploited to further improve our performance. We have constructed a test collection of 53 topics, which are Wikipedia pages on different entities. Our findings are that the anchor text index is a very effective method to retrieve home pages. Url class and anchor text length priors and their combination leads to the best results. Using Delicious on its own does not lead to very good results, but it does contain valuable information. Combining the best anchor text run and the Delicious run leads to further improvements.

information retrieval facility conference | 2011

Word clouds of multiple search results

Rianne Kaptein; Jaap Kamps

Search engine result pages (SERPs) are known as the most expensive real estate on the planet. Most queries yield millions of organic search results, yet searchers seldom look beyond the first handful of results. To make things worse, different searchers with different query intents may issue the exact same query. An alternative to showing individual web pages summarized by snippets is to represent whole group of results. In this paper we investigate if we can use word clouds to summarize groups of documents, e.g. to give a preview of the next SERP, or clusters of topically related documents. We experiment with three word cloud generation methods (full-text, query biased and anchor text based clouds) and evaluate them in a user study. Our findings are: First, biasing the cloud towards the query does not lead to test persons better distinguishing relevance and topic of the search results, but test persons prefer them because differences between the clouds are emphasized. Second, anchor text clouds are to be preferred over full-text clouds. Anchor text contains less noisy words than the full text of documents. Third, we obtain moderately positive results on the relation between the selected world clouds and the underlying search results: there is exact correspondence in 70% of the subtopic matching judgments and in 60% of the relevance assessment judgments. Our initial experiments open up new possibilities to have SERPs reflect a far larger number of results by using word clouds to summarize groups of search results.

Lecture Notes in Computer Science | 2009

Focused search in books and Wikipedia: categories, links and relevance feedback

Marijn Koolen; Rianne Kaptein; Jaap Kamps

In this paper we describe our participation in INEX 2009 in the Ad Hoc Track, the Book Track, and the Entity Ranking Track. In the Ad Hoc track we investigate focused link evidence, using only links from retrieved sections. The new collection is not only annotated with Wikipedia categories, but also with YAGO/WordNet categories. We explore how we can use both types of category information, in the Ad Hoc Track as well as in the Entity Ranking Track. Results in the Ad Hoc Track show Wikipedia categories are more effective than WordNet categories, and Wikipedia categories in combination with relevance feed-back lead to the best results. Preliminary results of the Book Track show full-text retrieval is effective for high early precision. Relevance feedback further increases early precision. Our findings for the Entity Ranking Track are in direct opposition of our Ad Hoc findings, namely, that the WordNet categories are more effective than the Wikipedia categories. This marks an interesting difference between ad hoc search and entity ranking.

exploiting semantic annotations in information retrieval | 2013

Recall oriented search on the web using semantic annotations

Rianne Kaptein; Egon L. van den Broek; Gijs Koot; Mirjam A.A. Huis in 't Veld

Web search engines are optimized for early precision, which makes it difficult to perform recall oriented tasks with them. In this article, we propose several ways to leverage semantic annotations and, thereby, increase the efficiency of recall oriented search tasks, with a focus on forensic investigation. Semantic annotations, such as temporal annotations, named entities, and domain context, can be used to rerank, and cluster search result sets. In addition, domain context can be used to improve recall.

Journal of the Association for Information Science and Technology | 2011

Explicit extraction of topical context

Rianne Kaptein; Jaap Kamps

This article studies one of the main bottlenecks in providing more effective information access: the poverty on the query end. We explore whether users can classify keyword queries into categories from the DMOZ directory on different levels and whether this topical context can help retrieval performance. We have conducted a user study to let participants classify queries into DMOZ categories, either by freely searching the directory or by selection from a list of suggestions. Results of the study show that DMOZ categories are suitable for topic categorization. Both free search and list selection can be used to elicit topical context. Free search leads to more specific categories than the list selections. Participants in our study show moderate agreement on the categories they select, but broad agreement on the higher levels of chosen categories. The free search categories significantly improve retrieval effectiveness. The more general list selection categories and the top-level categories do not lead to significant improvements. Combining topical context with blind relevance feedback leads to better results than applying either of them separately. We conclude that DMOZ is a suitable resource for interacting with users on topical categories applicable to their query, and can lead to better search results.

Explore More