Anne Schuth | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anne Schuth is active.

Explore More

Publication

Featured researches published by Anne Schuth.

conference on information and knowledge management | 2014

Multileaved Comparisons for Fast Online Evaluation

Anne Schuth; Floor Sietsma; Shimon Whiteson; Damien Lefortier; Maarten de Rijke

Evaluation methods for information retrieval systems come in three types: offline evaluation, using static data sets annotated for relevance by human judges; user studies, usually conducted in a lab-based setting; and online evaluation, using implicit signals such as clicks from actual users. For the latter, preferences between rankers are typically inferred from implicit signals via interleaved comparison methods, which combine a pair of rankings and display the result to the user. We propose a new approach to online evaluation called multileaved comparisons that is useful in the prevalent case where designers are interested in the relative performance of more than two rankers. Rather than combining only a pair of rankings, multileaved comparisons combine an arbitrary number of rankings. The resulting user clicks then give feedback about how all these rankings compare to each other. We propose two specific multileaved comparison methods. The first, called team draft multileave, is an extension of team draft interleave. The second, called optimized multileave, is an extension of optimized interleave and is designed to handle cases where a large number of rankers must be multileaved. We present experimental results that demonstrate that both team draft multileave and optimized multileave can accurately determine all pairwise preferences among a set of rankers using far less data than the interleaving methods that they extend.

international acm sigir conference on research and development in information retrieval | 2015

Predicting Search Satisfaction Metrics with Interleaved Comparisons

Anne Schuth; Katja Hofmann; Filip Radlinski

The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments en- compassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.

web search and data mining | 2016

Multileave Gradient Descent for Fast Online Learning to Rank

Anne Schuth; Harrie Oosterhuis; Shimon Whiteson; Maarten de Rijke

Modern search systems are based on dozens or even hundreds of ranking features. The dueling bandit gradient descent (DBGD) algorithm has been shown to effectively learn combinations of these features solely from user interactions. DBGD explores the search space by comparing a possibly improved ranker to the current production ranker. To this end, it uses interleaved comparison methods, which can infer with high sensitivity a preference between two rankings based only on interaction data. A limiting factor is that it can compare only to a single exploratory ranker. We propose an online learning to rank algorithm called multileave gradient descent (MGD) that extends DBGD to learn from so-called multileaved comparison methods that can compare a set of rankings instead of merely a pair. We show experimentally that MGD allows for better selection of candidates than DBGD without the need for more comparisons involving users. An important implication of our results is that orders of magnitude less user interaction data is required to find good rankers when multileaved comparisons are used within online learning to rank. Hence, fewer users need to be exposed to possibly inferior rankers and our method allows search engines to adapt more quickly to changes in user preferences.

cross language evaluation forum | 2015

Overview of the Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab 2015

Anne Schuth; Krisztian Balog; Liadh Kelly

In this paper we report on the first Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab. Our main goal with the lab is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in their natural task environments. For this first edition of the challenge we focused on two specific use-cases: product search and web search. Ranking systems submitted by participants were experimentally compared using interleaved comparisons to the production system from the corresponding use-case. In this paper we describe how these experiments were performed, what the resulting outcomes are, and conclude with some lessons learned.

conference on information and knowledge management | 2014

Head First: Living Labs for Ad-hoc Search Evaluation

Krisztian Balog; Liadh Kelly; Anne Schuth

The information retrieval (IR) community strives to make evaluation more centered on real users and their needs. The living labs evaluation paradigm, i.e., observing users in their natural task environments, offers great promise in this regard. Yet, progress in an academic setting has been limited. This paper presents the first living labs for the IR community benchmarking campaign initiative, taking as test two use-cases: local domain search on a university website and product search on an e-commerce site. There are many challenges associated with this setting, including incorporating results from experimental search systems into live production systems, and obtaining sufficiently many impressions from relatively low traffic sites. We propose that head queries can be used to generate result lists offline, which are then interleaved with results of the production system for live evaluation. An API is developed to orchestrate the communication between commercial parties and benchmark participants. This campaign acts to progress the living labs for IR evaluation methodology, and offers important insight into the role of living labs in this space.

International Workshop of the Initiative for the Evaluation of XML Retrieval | 2011

University of Amsterdam Data Centric Ad Hoc and Faceted Search Runs

Anne Schuth; Maarten Marx

We describe the ad hoc and faceted search runs for the 2011 INEX data centric task that were submitted by the ILPS group of the University of Amsterdam. In our runs, we translate the content-and-structure queries into XQuery with full-text expressions and process them using the eXist+Lucene XQuery with full-text processor.

international acm sigir conference on research and development in information retrieval | 2015

Probabilistic Multileave for Online Retrieval Evaluation

Anne Schuth; Robert-Jan Bruintjes; Fritjof Buüttner; Joost van Doorn; Carla Groenland; Harrie Oosterhuis; Cong-Nguyen Tran; Bas Veeling; Jos van der Velde; Roger Wechsler; David Woudenberg; Maarten de Rijke

Online evaluation methods for information retrieval use implicit signals such as clicks from users to infer preferences between rankers. A highly sensitive way of inferring these preferences is through interleaved comparisons. Recently, interleaved comparisons methods that allow for simultaneous evaluation of more than two rankers have been introduced. These so-called multileaving methods are even more sensitive than their interleaving counterparts. Probabilistic interleaving--whose main selling point is the potential for reuse of historical data--has no multileaving counterpart yet. We propose probabilistic multileave and empirically show that it is highly sensitive and unbiased. An important implication of this result is that historical interactions with multileaved comparisons can be reused, allowing for ranker comparisons that need much less user interaction data. Furthermore, we show that our method, as opposed to earlier sensitive multileaving methods, scales well when the number of rankers increases.

ACM Transactions on Information Systems | 2015

A Comparative Analysis of Interleaving Methods for Aggregated Search

Aleksandr Chuklin; Anne Schuth; Ke Zhou; Maarten de Rijke

A result page of a modern search engine often goes beyond a simple list of “10 blue links.” Many specific user needs (e.g., News, Image, Video) are addressed by so-called aggregated or vertical search solutions: specially presented documents, often retrieved from specific sources, that stand out from the regular organic Web search results. When it comes to evaluating ranking systems, such complex result layouts raise their own challenges. This is especially true for so-called interleaving methods that have arisen as an important type of online evaluation: by mixing results from two different result pages, interleaving can easily break the desired Web layout in which vertical documents are grouped together, and hence hurt the user experience. We conduct an analysis of different interleaving methods as applied to aggregated search engine result pages. Apart from conventional interleaving methods, we propose two vertical-aware methods: one derived from the widely used Team-Draft Interleaving method by adjusting it in such a way that it respects vertical document groupings, and another based on the recently introduced Optimized Interleaving framework. We show that our proposed methods are better at preserving the user experience than existing interleaving methods while still performing well as a tool for comparing ranking systems. For evaluating our proposed vertical-aware interleaving methods, we use real-world click data as well as simulated clicks and simulated ranking systems.

cross language evaluation forum | 2011

Evaluation methods for rankings of facetvalues for faceted search

Anne Schuth; Maarten Marx

We introduce two metrics aimed at evaluating systems that select facetvalues for a faceted search interface. Facetvalues are the values of meta-data fields in semi-structured data and are commonly used to refine queries. It is often the case that there are more facetvalues than can be displayed to a user and thus a selection has to be made. Our metrics evaluate these selections based on binary relevant assessments for the documents in a collection. Both our metrics are based on Normalized Discounted Cumulated Gain, an often used Information Retrieval metric.

european conference on information retrieval | 2014

Effects of Position Bias on Click-Based Recommender Evaluation

Katja Hofmann; Anne Schuth; Alejandro Bellogín; Maarten de Rijke

Measuring the quality of recommendations produced by a recommender system RS is challenging. Labels used for evaluation are typically obtained from users of a RS, by asking for explicit feedback, or inferring labels from implicit feedback. Both approaches can introduce significant biases in the evaluation process. We investigate biases that may affect labels inferred from implicit feedback. Implicit feedback is easy to collect but can be prone to biases, such as position bias. We examine this bias using click models, and show how bias following these models would affect the outcomes of RS evaluation. We find that evaluation based on implicit and explicit feedback can agree well, but only when the evaluation metrics are designed to take user behavior and preferences into account, stressing the importance of understanding user behavior in deployed RSs.

Explore More