Fabrizio Sebastiani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fabrizio Sebastiani is active.

Explore More

Publication

Featured researches published by Fabrizio Sebastiani.

ACM Computing Surveys | 2002

Machine learning in automated text categorization

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

european conference on research and advanced technology for digital libraries | 2000

Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

Luigi Galavotti; Fabrizio Sebastiani; Maria Simi

We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection (FS) refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel FS technique, based on a simplified variant of the X2 statistics. Classifier induction refers instead to the problem of automatically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard REUTERS-21578 benchmark.

Journal of the Association for Information Science and Technology | 2005

An analysis of the relative hardness of Reuters-21578 subsets

Franca Debole; Fabrizio Sebastiani

The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark.The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years.However, the benefits that this has brought about have somehow been limited by the fact that different researchers have “carved” different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable.In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers.The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.

european conference on information retrieval | 2009

Multi-facet Rating of Product Reviews

Stefano Baccianella; Andrea Esuli; Fabrizio Sebastiani

Online product reviews are becoming increasingly available, and are being used more and more frequently by consumers in order to choose among competing products. Tools that rank competing products in terms of the satisfaction of consumers that have purchased the product before, are thus also becoming popular. We tackle the problem of rating (i.e., attributing a numerical score of satisfaction to) consumer reviews based on their textual content. We here focus on multi-facet review rating, i.e., on the case in which the review of a product (e.g., a hotel) must be rated several times, according to several aspects of the product (for a hotel: cleanliness, centrality of location, etc.). We explore several aspects of the problem, with special emphasis on how to generate vectorial representations of the text by means of POS tagging, sentiment analysis, and feature selection for ordinal regression learning. We present the results of experiments conducted on a dataset of more than 15,000 reviews that we have crawled from a popular hotel review site.

Journal of the ACM | 2001

A model of multimedia information retrieval

Carlo Meghini; Fabrizio Sebastiani; Umberto Straccia

Research on multimedia information retrieval (MIR) has recently witnessed a booming interest. A prominent feature of this research trend is its simultaneous but independent materialization within several fields of computer science. The resulting richness of paradigms, methods and systems may, on the long run, result in a fragmentation of efforts and slow down progress. The primary goal of this study is to promote an integration of methods and techniques for MIR by contributing a conceptual model that encompasses in a unified and coherent perspective the many efforts that are being produced under the label of MIR. The model offers a retrieval capability that spans two media, text and images, but also several dimensions: form, content and structure. In this way, it reconciles similarity-based methods with semantics-based ones, providing the guidelines for the design of systems that are able to provide a generalized multimedia retrieval service, in which the existing forms of retrieval not only coexist, but can be combined in any desired manner. The model is formulated in terms of a fuzzy description logic, which plays a twofold role: (1) it directly models semantics-based retrieval, and (2) it offers an ideal framework for the integration of the multimedia and multidimensional aspects of retrieval mentioned above. The model also accounts for relevance feedback in both text and image retrieval, integrating known techniques for taking into account user judgments. The implementation of the model is addressed by presenting a decomposition technique that reduces query evaluation to the processing of simpler requests, each of which can be solved by means of widely known methods for text and image retrieval, and semantic processing. A prototype for multidimensional image retrieval is presented that shows this decomposition technique at work in a significant case.

intelligent systems design and applications | 2009

Evaluation Measures for Ordinal Regression

Stefano Baccianella; Andrea Esuli; Fabrizio Sebastiani

Ordinal regression (OR -- also known as ordinal classification) has received increasing attention in recent times, due to its importance in IR applications such as learning to rank and product review rating. However, research has not paid attention to the fact that typical applications of OR often involve datasets that are highly imbalanced. An imbalanced dataset has the consequence that, when testing a system with an evaluation measure conceived for balanced datasets, a trivial system assigning all items to a single class (typically, the majority class) may even outperform genuinely engineered systems. Moreover, if this evaluation measure is used for parameter optimization, a parameter choice may result that makes the system behave very much like a trivial system. In order to avoid this, evaluation measures that can handle imbalance must be used. We propose a simple way to turn standard measures for OR into ones robust to imbalance. We also show that, once used on balanced datasets, the two versions of each measure coincide, and therefore argue that our measures should become the standard choice for OR.

international acm sigir conference on research and development in information retrieval | 1994

A probabilistic terminological logic for modelling information retrieval

Fabrizio Sebastiani

Some researchers have recently argued that the task of Information Retrieval (IR) may successfully be described by means of mathematical logic; accordingly, the relevance of a given document to a given information need should be assessed by checking the validity of the logical formula d → n,where d is the representation of the document, n is the representation of the information need and “→” is the conditional connective of the logic in question. In a recent paper we have proposed Terminological Logics (TLs) as suitable logics for modelling IR within the paradigm described above. This proposal, however, while making a step towards adequately modelling IR in a logical way, does not account for the fact that the relevance of a document to an information need can only be assessed up to a limited degree of certainty. In this work, we try to overcome this limitation by introducing a model of IR based on a Probabilistic TL, i.e. a logic allowing the expression of real-valued terms representing probability values and possibly involving expressions of a TL. Two different types of probabilistic information, i.e. statistical information and information about degrees of belief, can be accounted for in this logic. The paper presents a formal syntax and a denotational (possible-worlds) semantics for this logic, and discusses, by means of a number of examples, its adequacy as a formal tool for describing IR.

conference on information and knowledge management | 2000

An improved boosting algorithm and its application to text categorization

Fabrizio Sebastiani; Alessandro Sperduti; Nicola Valdambrini

We describe an improved boosting algorithm, called {\sc AdaBoost.MH

Information Retrieval | 2008

Boosting multi-label hierarchical text categorization

Andrea Esuli; Tiziano Fagni; Fabrizio Sebastiani

^{KR}

string processing and information retrieval | 2006

Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

Filippo Geraci; Marco Pellegrini; Marco Maggini; Fabrizio Sebastiani

}, and its application to text categorization. Boosting is a method for supervised learning which has successfully been applied to many different domains, and that has proven one of the best performers in text categorization exercises so far. Boosting is based on the idea of relying on the collective judgment of a committee of classifiers that are trained sequentially. In training the

Explore More