Eddy Maddalena | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eddy Maddalena is active.

Explore More

Publication

Featured researches published by Eddy Maddalena.

ACM Transactions on Information Systems | 2017

On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation

Eddy Maddalena; Stefano Mizzaro; Falk Scholer; Andrew Turpin

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents for information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting over 50,000 magnitude estimation judgments using crowdsourcing. Our analysis shows that magnitude estimation judgments can be reliably collected using crowdsourcing, are competitive in terms of assessor cost, and are, on average, rank-aligned with ordinal judgments made by expert relevance assessors. We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance: different users have different perceptions of the impact of relative differences in document relevance. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold.

international acm sigir conference on research and development in information retrieval | 2015

The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation

Andrew Turpin; Falk Scholer; Stefano Mizzaro; Eddy Maddalena

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents in the context of information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting more than 50,000 magnitude estimation judgments. Our analysis shows that on average magnitude estimation judgments are rank-aligned with ordinal judgments made by expert relevance assessors. An advantage of magnitude estimation is that users can chose their own scale for judgments, allowing deeper investigations of user perceptions than when categorical scales are used. We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance, in terms of how impactful relative differences in document relevance are perceived to be. We further use magnitude estimation to investigate gain profiles, comparing the currently assumed linear and exponential approaches with actual user-reported relevance perceptions. This indicates that the currently used exponential gain profiles in nDCG and ERR are mismatched with an average user, but perhaps more importantly that individual perceptions are highly variable. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold. Finally, we demonstrate that magnitude estimation judgments can be reliably collected using crowdsourcing, and are competitive in terms of assessor cost.

Diagnostic Pathology | 2014

Preliminary results from a crowdsourcing experiment in immunohistochemistry

Vincenzo Della Mea; Eddy Maddalena; Stefano Mizzaro; Piernicola Machin; Carlo Alberto Beltrami

BackgroundCrowdsourcing, i.e., the outsourcing of tasks typically performed by a few experts to a large crowd as an open call, has been shown to be reasonably effective in many cases, like Wikipedia, the Chess match of Kasparov against the world in 1999, and several others. The aim of the present paper is to describe the setup of an experimentation of crowdsourcing techniques applied to the quantification of immunohistochemistry.MethodsFourteen Images from MIB1-stained breast specimens were first manually counted by a pathologist, then submitted to a crowdsourcing platform through a specifically developed application. 10 positivity evaluations for each image have been collected and summarized using their median. The positivity values have been then compared to the gold standard provided by the pathologist by means of Spearman correlation.ResultsContributors were in total 28, and evaluated 4.64 images each on average. Spearman correlation between gold and crowdsourced positivity percentages is 0.946 (p < 0.001).ConclusionsAim of the experiment was to understand how to use crowdsourcing for an image analysis task that is currently time-consuming when done by human experts. Crowdsourced work can be used in various ways, in particular statistically agregating data to reduce identification errors. However, in this preliminary experimentation we just considered the most basic indicator, that is the median positivity percentage, which provided overall good results. This method might be more aimed to research than routine: when a large number of images are in need of ad-hoc evaluation, crowdsourcing may represent a quick answer to the need.

european conference on information retrieval | 2015

Judging relevance using magnitude estimation

Eddy Maddalena; Stefano Mizzaro; Falk Scholer; Andrew Turpin

Magnitude estimation is a psychophysical scaling technique whereby numbers are assigned to stimuli to reflect the ratios of their perceived intensity. We report on a crowdsourcing experiment aimed at understanding if magnitude estimation can be used to gather reliable relevance judgements for documents, as is commonly required for test collection-based evaluation of information retrieval systems. Results on a small dataset show that: (i) magnitude estimation can produce relevance rankings that are consistent with more classical ordinal judgements; (ii) both an upper-bounded and an unbounded scale can be used effectively, though with some differences; (iii) the presentation order of the documents being judged has a limited effect, if any; and (iv) only a small number repeat judgements are required to obtain reliable magnitude estimation scores.

international conference on the theory of information retrieval | 2017

Considering Assessor Agreement in IR Evaluation

Eddy Maddalena; Kevin Roitero; Gianluca Demartini; Stefano Mizzaro

The agreement between relevance assessors is an important but understudied topic in the Information Retrieval literature because of the limited data available about documents assessed by multiple judges. This issue has gained even more importance recently in light of crowdsourced relevance judgments, where it is customary to gather many relevance labels for each topic-document pair. In a crowdsourcing setting, agreement is often even used as a proxy for quality, although without any systematic verification of the conjecture that higher agreement corresponds to higher quality. In this paper we address this issue and we study in particular: the effect of topic on assessor agreement; the relationship between assessor agreement and judgment quality; the effect of agreement on ranking systems according to their effectiveness; and the definition of an agreement-aware effectiveness metric that does not discard information about multiple judgments for the same document as it typically happens in a crowdsourcing setting.

european conference on information retrieval | 2017

Do Easy Topics Predict Effectiveness Better Than Difficult Topics

Kevin Roitero; Eddy Maddalena; Stefano Mizzaro

After a network-based analysis of TREC results, Mizzaro and Robertson [4] found the rather unpleasant result that topic ease (i.e., the average effectiveness of the participating systems, measured with average precision) correlates with the ability of topics to predict system effectiveness (defined as topic hubness). We address this issue by: (i) performing a more detailed analysis, and (ii) using three different datasets. Our results are threefold. First, we confirm that the original result is indeed correct and general across datasets. Second, we show that, however, that result is less worrying than what might seem at first glance, since it depends on considering the least effective systems in the analysis. In other terms, easy topics discriminate most and least effective systems, but when focussing on the most effective systems only this is no longer true. Third, we also clarify what happens when using the GMAP metric.

Distributed and Parallel Databases | 2015

Mobile crowdsourcing: four experiments on platforms and tasks

Vincenzo Della Mea; Eddy Maddalena; Stefano Mizzaro

We study whether the tasks currently proposed on crowdsourcing platforms are adequate to mobile devices. We aim at understanding both (i) which crowdsourcing platforms, among the existing ones, are more adequate to mobile devices, and (ii) which kinds of tasks are more adequate to mobile devices. Results of four diversified experiments (three user studies and one heuristic evaluation) hint that: some crowdsourcing platforms seem more adequate to mobile devices than others; some inadequacy issues seem rather superficial and can be resolved by a better task design; some kinds of tasks are more adequate than others; there might be some unexpected opportunities with mobile devices; and spam on the requester side should be taken into account.

international acm sigir conference on research and development in information retrieval | 2018

On Fine-Grained Relevance Scales

Kevin Roitero; Eddy Maddalena; Gianluca Demartini; Stefano Mizzaro

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

international conference on asian language processing | 2016

Towards building a standard dataset for Arabic keyphrase extraction evaluation

Muhammad Helmy; Marco Basaldella; Eddy Maddalena; Stefano Mizzaro; Gianluca Demartini

Keyphrases are short phrases that best represent a document content. They can be useful in a variety of applications, including document summarization and retrieval models. In this paper, we introduce the first dataset of keyphrases for an Arabic document collection, obtained by means of crowdsourcing. We experimentally evaluate different crowdsourced answer aggregation strategies and validate their performances against expert annotations to evaluate the quality of our dataset. We report about our experimental results, the dataset features, some lessons learned, and ideas for future work.

DBCrowd | 2013