Kevin Roitero
University of Udine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kevin Roitero.
international conference on the theory of information retrieval | 2017
Eddy Maddalena; Kevin Roitero; Gianluca Demartini; Stefano Mizzaro
The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance recently in light of crowdsourced relevance judgments, where it is customary to gather many relevance labels for each topic-document pair. In a crowdsourcing setting, agreement is often even used as a proxy for quality, although without any systematic verification of the conjecture that higher agreement corresponds to higher quality. In this paper we address this issue and we study in particular: the effect of topic on assessor agreement; the relationship between assessor agreement and judgment quality; the effect of agreement on ranking systems according to their effectiveness; and the definition of an agreement-aware effectiveness metric that does not discard information about multiple judgments for the same document as it typically happens in a crowdsourcing setting.
european conference on information retrieval | 2017
Kevin Roitero; Eddy Maddalena; Stefano Mizzaro
After a network-based analysis of TREC results, Mizzaro and Robertson [4] found the rather unpleasant result that topic ease (i.e., the average effectiveness of the participating systems, measured with average precision) correlates with the ability of topics to predict system effectiveness (defined as topic hubness). We address this issue by: (i) performing a more detailed analysis, and (ii) using three different datasets. Our results are threefold. First, we confirm that the original result is indeed correct and general across datasets. Second, we show that, however, that result is less worrying than what might seem at first glance, since it depends on considering the least effective systems in the analysis. In other terms, easy topics discriminate most and least effective systems, but when focussing on the most effective systems only this is no longer true. Third, we also clarify what happens when using the GMAP metric.
international acm sigir conference on research and development in information retrieval | 2018
Stefano Mizzaro; Josiane Mothe; Kevin Roitero; Zia Ullah
Some methods have been developed for automatic effectiveness evaluation without relevance judgments. We propose to use those methods, and their combination based on a machine learning approach, for query performance prediction. Moreover, since predicting average precision as it is usually done in query performance prediction literature is sensitive to the reference system that is chosen, we focus on predicting the average of average precision values over several systems. Results of an extensive experimental evaluation on ten TREC collections show that our proposed methods outperform state-of-the-art query performance predictors.
international acm sigir conference on research and development in information retrieval | 2018
Kevin Roitero; Michael Soprano; Stefano Mizzaro
Several researchers have proposed to reduce the number of topics used in TREC-like initiatives. One research direction that has been pursued is what is the optimal topic subset of a given cardinality that evaluates the systems/runs in the most accurate way. Such a research direction has been so far mainly theoretical, with almost no indication on how to select the few good topics in practice. We propose such a practical criterion for topic selection: we rely on the methods for automatic system evaluation without relevance judgments, and by running some experiments on several TREC collections we show that the topics selected on the basis of those evaluations are indeed more informative than random topics.
international acm sigir conference on research and development in information retrieval | 2018
Kevin Roitero
In test collection based evaluation of retrieval effectiveness, many research investigated different directions for an economical and a semi-automatic evaluation of retrieval systems. Although several methods have been proposed and experimentally evaluated, their accuracy seems still limited. In this paper we present our proposal for a more engineered approach to information retrieval evaluation.
international acm sigir conference on research and development in information retrieval | 2018
Kevin Roitero; Eddy Maddalena; Gianluca Demartini; Stefano Mizzaro
In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.
Journal of Data and Information Quality | 2018
Kevin Roitero; Marco Passon; Giuseppe Serra; Stefano Mizzaro
The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate effectiveness completely automatically, without human relevance assessments. Since human relevance assessment is one of the main costs of building a test collection, both in human time and money resources, this rather ambitious goal would have a practical impact. In this article, we reproduce the main results on evaluating information retrieval systems without relevance judgments; furthermore, we generalize such previous work to analyze the effect of test collections, evaluation metrics, and pool depth. We also expand the idea to semi-automatic evaluation and estimation of topic difficulty. Our results show that (i) previous work is overall reproducible, although some specific results are not; (ii) collection, metric, and pool depth impact the automatic evaluation of systems, which is anyway accurate in several cases; (iii) semi-automatic evaluation is an effective methodology; and (iv) automatic evaluation can (to some extent) be used to predict topic difficulty.
Journal of Data and Information Quality | 2018
Kevin Roitero; Michael Soprano; Andrea Brunello; Stefano Mizzaro
Effectiveness evaluation of information retrieval systems by means of a test collection is a widely used methodology. However, it is rather expensive in terms of resources, time, and money; therefore, many researchers have proposed methods for a cheaper evaluation. One particular approach, on which we focus in this article, is to use fewer topics: in TREC-like initiatives, usually system effectiveness is evaluated as the average effectiveness on a set of n topics (usually, n=50, but more than 1,000 have been also adopted); instead of using the full set, it has been proposed to find the best subsets of a few good topics that evaluate the systems in the most similar way to the full set. The computational complexity of the task has so far limited the analysis that has been performed. We develop a novel and efficient approach based on a multi-objective evolutionary algorithm. The higher efficiency of our new implementation allows us to reproduce some notable results on topic set reduction, as well as perform new experiments to generalize and improve such results. We show that our approach is able to both reproduce the main state-of-the-art results and to allow us to analyze the effect of the collection, metric, and pool depth used for the evaluation. Finally, differently from previous studies, which have been mainly theoretical, we are also able to discuss some practical topic selection strategies, integrating results of automatic evaluation approaches.
international acm sigir conference on research and development in information retrieval | 2018
Kevin Roitero; Eddy Maddalena; Yannick Ponte; Stefano Mizzaro
national conference on artificial intelligence | 2017
Alessandro Checco; Kevin Roitero; Eddy Maddalena; Stefano Mizzaro; Gianluca Demartini