Is this you? Create Your Porfile

Giacomo Berardi

Istituto di Scienza e Tecnologie dell'Informazione

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giacomo Berardi is active.

Explore More

Publication

Featured researches published by Giacomo Berardi.

conference on information and knowledge management | 2015

Semi-Automated Text Classification for Sensitivity Identification

Giacomo Berardi; Andrea Esuli; Craig Macdonald; Iadh Ounis; Fabrizio Sebastiani

Sensitive documents are those that cannot be made public, e.g., for personal or organizational privacy reasons. For instance, documents requested through Freedom of Information mechanisms must be manually reviewed for the presence of sensitive information before their actual release. Hence, tools that can assist human reviewers in spotting sensitive information are of great value to government organizations subject to Freedom of Information laws. We look at sensitivity identification in terms of semi-automated text classification (SATC), the task of ranking automatically classified documents so as to optimize the cost-effectiveness of human post-checking work. We use a recently proposed utility-theoretic approach to SATC that explicitly optimizes the chosen effectiveness function when ranking the documents by sensitivity; this is especially useful in our case, since sensitivity identification is a recall-oriented task, thus requiring the use of a recall-oriented evaluation measure such as F2. We show the validity of this approach by running experiments on a multi-label multi-class dataset of government documents manually annotated according to different types of sensitivity.

ACM Transactions on Knowledge Discovery From Data | 2015

Utility-Theoretic Ranking for Semiautomated Text Classification

Giacomo Berardi; Andrea Esuli; Fabrizio Sebastiani

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected increase is maximized. An obvious SATC strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.

theory and practice of digital libraries | 2012

Metadata enrichment services for the europeana digital library

Giacomo Berardi; Andrea Esuli; Sergiu Gordea; Diego Marcheggiani; Fabrizio Sebastiani

We demonstrate a metadata enrichment system for the Europeana digital library. The system allows different institutions which provide to Europeana pointers (in the form of metadata records - MRs) to their content to enrich their MRs by classifying them under a classification scheme of their choice, and to extract/highlight entities of significant interest within the MRs themselves. The use of a supervised learning metaphor allows each content provider (CP) to generate classifiers and extractors tailored to the CPs specific needs, thus allowing the tool to be effectively available to the multitude (2000+) of Europeana CPs.

international acm sigir conference on research and development in information retrieval | 2014

Semi-automated text classification

Giacomo Berardi

There is currently a high demand for information systems that automatically analyze textual data, since many organizations, both private and public, need to process large amounts of such data as part of their daily routine, an activity that cannot be performed by means of human work only. One of the answers to this need is text classification (TC), the task of automatically labelling textual documents from a domain D with thematic categories from a predefined set C. Modern text classification systems have reached high efficiency standards, but cannot always guarantee the labelling accuracy that applications demand. When the level of accuracy that can be obtained is insufficient, one may revert to processes in which classification is performed via a combination of automated activity and human effort. One such process is semi-automated text classification (SATC), which we define as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of D with the goal of increasing the overall labelling accuracy of D, the expected such increase is maximized. An obvious strategy is to rank D so that the documents that the classifier has labelled with the lowest confidence are top-ranked. In this dissertation we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of validation gain, defined as the improvement in classification efectiveness that would derive by validating a given automatically labelled document. We also propose new effectiveness measures for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a ranked list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measures, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error. We therefore explore the task of SATC and the potential of our methods, in multiple text classification contexts. This dissertation is, to the best of our knowledge, the first to systematically address the task of semi-automated text classification.

applications of natural language to data bases | 2012

Blog distillation via sentiment-sensitive link analysis

Giacomo Berardi; Andrea Esuli; Fabrizio Sebastiani; Fabrizio Silvestri

In this paper we approach blog distillation by adding a link analysis phase to the standard retrieval-by-topicality phase, where we also we check whether a given hyperlink is a citation with a positive or a negative nature. This allows us to test the hypothesis that distinguishing approval from disapproval brings about benefits in blog distillation.

international acm sigir conference on research and development in information retrieval | 2016

Sedano: A News Stream Processor for Business

Ugo Scaiella; Giacomo Berardi; Giuliano Mega; Roberto Santoro

We present Sedano, a system for processing and indexing a continuous stream of business-related news. Sedano defines pipelines whose stages analyze and enrich news items (e.g., newspaper articles and press releases). News data coming from several content sources are stored, processed and then indexed in order to be consumed by Atoka, our business intelligence product. Atoka users can retrieve news about specific companies, filtering according to various facets. Sedano features both an entity-linking phase, which finds mentions of companies in news, and a classification phase, which classifies news according to a set of business events. Its flexible architecture allows Sedano to be deployed on commodity machines while being scalable and fault-tolerant

empirical methods in natural language processing | 2015

A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining

Salud María Jiménez-Zafra; Giacomo Berardi; Andrea Esuli; Diego Marcheggiani; María Teresa Martín-Valdivia; Alejandro Moreo Fernández

We present the Trip-MAML dataset, a Multi-Lingual dataset of hotel reviews that have been manually annotated at the sentence-level with Multi-Aspect sentiment labels. This dataset has been built as an extension of an existent English-only dataset, adding documents written in Italian and Spanish. We detail the dataset construction process, covering the data gathering, selection, and annotation. We present inter-annotator agreement figures and baseline experimental results, comparing the three languages. Trip-MAML is a multi-lingual dataset for aspect-oriented opinion mining that enables researchers (i) to face the problem on languages other than English and (ii) to the experiment the application of cross-lingual learning methods to the task.

acm symposium on applied computing | 2015

On the impact of entity linking in microblog real-time filtering

Giacomo Berardi; Diego Ceccarelli; Andrea Esuli; Diego Marcheggiani

Microblogging is a model of content sharing in which the temporal locality of posts with respect to important events, either of foreseeable or unforeseeable nature, makes applications of real-time filtering of great practical interest. We propose the use of Entity Linking (EL) in order to improve the retrieval effectiveness, by enriching the representation of microblog posts and filtering queries. EL is the process of recognizing in an unstructured text the mention of relevant entities described in a knowledge base. EL of short pieces of text is a difficult task, but it is also a scenario in which the information EL adds to the text can have a substantial impact on the retrieval process. We implement a start-of-the-art filtering method, based on the best systems from the TREC Microblog track real-time adhoc retrieval and filtering tasks, and extend it with a Wikipedia-based EL method. Results show that the use of EL significantly improves over non-EL based versions of the filtering methods.

acm symposium on applied computing | 2015

Classifying websites by industry sector: a study in feature design

Giacomo Berardi; Andrea Esuli; Tiziano Fagni; Fabrizio Sebastiani

Classifying companies by industry sector is an important task in finance, since it allows investors and research analysts to analyse specific subsectors of local and global markets for investment monitoring and planning purposes. Traditionally this classification activity has been performed manually, by dedicated specialists carrying out in-depth analysis of a companys public profile. However, this is more and more unsuitable in nowadayss globalised markets, in which new companies spring up, old companies cease to exist, and existing companies refocus their efforts to different sectors at an astounding pace. As a result, tools for performing this classification automatically are increasingly needed. We address the problem of classifying companies by industry sector via the automatic classification of their websites, since the latter provide rich information about the nature of the company and market segment it targets. We have built a website classification system and tested its accuracy on a dataset of more than 20,000 company websites classified according to a 2-level taxonomy of 216 leaf classes explicitly designed for market research purposes. Our experimental study provides interesting insights as to which types of features are the most useful for this classification task.

text retrieval conference | 2011