Margherita Berardi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Margherita Berardi is active.

Explore More

Publication

Featured researches published by Margherita Berardi.

First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings. | 2004

Machine learning methods for automatically processing historical documents: from paper acquisition to XML transformation

Floriana Esposito; Donato Malerba; Giovanni Semeraro; Stefano Ferilli; Oronzo Altamura; Teresa Maria Altomare Basile; Margherita Berardi; Michelangelo Ceci; N. Di Mauro

One of the aims of the EU project COLLATE is to design and implement a Web-based collaboratory for archives, scientists and end-users working with digitized cultural material. Since the originals of such a material are often unique and scattered in various archives, severe problems arise for their wide fruition. A solution would be to develop intelligent document processing tools that automatically transform printed documents into a Web-accessible form such as XML. Here, we propose the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and report promising results obtained in preliminary experiments.

international conference theory and practice digital libraries | 2003

Document-Centered Collaboration for Scholars in the Humanities - The COLLATE System

Ingo Frommholz; Holger Brocks; Ulrich Thiel; Erich J. Neuhold; Luigi Iannone; Giovanni Semeraro; Margherita Berardi; Michelangelo Ceci

In contrast to electronic document collections we find in contemporary digital libraries, systems applied in the cultural domain have to satisfy specific requirements with respect to data ingest, management, and access. Such systems should also be able to support the collaborative work of domain experts and furthermore offer mechanisms to exploit the value-added information resulting from a collaborative process like scientific discussions. In this paper, we present the solutions to these requirements developed and realized in the COLLATE system, where advanced methods for document classification, content management, and a new kind of context-based retrieval using scientific discourses are applied.

international conference on document analysis and recognition | 2003

Correcting the document layout: a machine learning approach

Donato Malerba; Floriana Esposito; Oronzo Altamura; Michelangelo Ceci; Margherita Berardi

In this paper, a machine learning approach to support the user during the correction of the layout analysis is proposed. Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In our approach, the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. We investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multi-page documents are reported and commented.

international syposium on methodologies for intelligent systems | 2005

Mining and filtering multi-level spatial association rules with ARES

Annalisa Appice; Margherita Berardi; Michelangelo Ceci; Donato Malerba

In spatial data mining, a common task is the discovery of spatial association rules from spatial databases. We propose a distributed system, named ARES that takes advantage of the use of a multi-relational approach to mine spatial association rules. It supports spatial database coupling and discovery of multi-level spatial association rules as a means for spatial data exploration. We also present some criteria to bias the search and to filter the discovered rules according to users expectations. Finally, we show the applicability of our proposal to two different real world domains, namely, document image processing and geo-referenced analysis of census data.

Applied Artificial Intelligence | 2007

RELATIONAL DATA MINING AND ILP FOR DOCUMENT IMAGE UNDERSTANDING

Michelangelo Ceci; Margherita Berardi; Donato Malerba

Document image understanding denotes the recognition of semantically relevant components in the layout extracted from a document image. This recognition process is based on domain-specific knowledge that can be acquired automatically by applying data mining techniques. The spatial dimension of page layout makes classification methods developed in inductive logic programming (ILP) and multi-relational data mining (MRDM) the most suitable candidates for this specific task. In this paper, both approaches are considered and empirically compared on three different data sets consisting of multi-page articles published in an international journal and historical documents. The ILP method is able to learn recursive logical theories that express dependencies between logical components, while the MRDM method extends the naïve Bayesian classifier to data stored in multiple tables of a relational database. Experimental results confirm the importance of the spatial dimension for this application and show that the ILP method tends to be conservative with a high (low) percentage of omission (commission) errors, while the probabilistic nature of the MRDM method allows us to tradeoff between the two types of error.

Machine Learning in Document Analysis and Recognition | 2008

Machine Learning for Reading Order Detection in Document Image Understanding

Donato Malerba; Michelangelo Ceci; Margherita Berardi

Summary. Document image understanding refers to logical and semantic analysis of document images in order to extract information understandable to humans and codify it into machine-readable form. Most of the studies on document image understanding have targeted the specific problem of associating layout components with logical labels, while less attention has been paid to the problem of extracting relationships between logical components, such as cross-references. In this chapter, we investigate the problem of detecting the reading order relationship between components of a logical structure. The domain specific knowledge required for this task is automatically acquired from a set of training examples by applying a machine learning method. The input of the learning method is the description of “chains” of layout components defined by the user. The output is a logical theory which defines two predicates, fi rst to read/ 1a ndsucc in reading/2, useful for consistently reconstructing all chains in the training set. Only spatial information on the page layout is exploited for both single and multiple chain reconstruction. The proposed approach has been evaluated on a set of document images processed by the system WISDOM++. Documents are characterized by two important structures: the layout structure and the logical structure. Both are the results of repeatedly dividing the content of a document into increasingly smaller parts, and are typically represented by means of a tree structure. The difference between them is the criteria adopted for structuring the document content: the layout structure is based on the presentation of the content, while the logical structure is based on the human-perceptible meaning of the content. The extraction of the layout structures from images of scanned paper documents is a complex process, typically denoted as document layout analysis, which involves several steps including preprocessing, page decomposition (or segmentation), classification of segments according to content type (e.g., text, graphics, pictures) and hierarchical organization on the basis of perceptual

industrial and engineering applications of artificial intelligence and expert systems | 2005

Mining generalized association rules on biomedical literature

Margherita Berardi; Michele Lapi; Pietro Leo; Corrado Loglisci

The discovery of new and potentially meaningful relationships between concepts in the biomedical literature has attracted the attention of a lot of researchers in text mining. The main motivation is found in the increasing availability of the biomedical literature which makes it difficult for researchers in biomedicine to keep up with research progresses without the help of automatic knowledge discovery techniques. More than 14 million abstracts of this literature are contained in the Medline collection and are available online. In this paper we present the application of an association rule mining method to Medline abstracts in order to detect associations between concepts as indication of the existence of a biomedical relation among them. The discovery process fully exploits the MeSH (Medical Subject Headings) taxonomy, that is, a set of hierarchically related biomedical terms which permits to express associations at different levels of abstraction (generalized association rules). We report experimental results on a collection of abstracts obtained by querying Medline on a specific disease and we show the effectiveness of some filtering and browsing techniques designed to manage the huge amount of generalized associations that may be generated on real data.

inductive logic programming | 2007

Learning Recursive Patterns for Biomedical Information Extraction

Margherita Berardi; Donato Malerba

Information in text form remains a greatly unexploited source of biological information. Information Extraction (IE) techniques are necessary to map this information into structured representations that allow facts relating domain-relevant entities to be automatically recognized. In biomedical IE tasks, extracting patterns that model implicit relations among entities is particularly important since biological systems intrinsically involve interactions among several entities. In this paper, we resort to an Inductive Logic Programming (ILP) approach for the discovery of mutual recursive patterns from text. Mutual recursion allows dependencies among entities to be explored in data and extraction models to be applied in a context-sensitive mode. In particular, IE models are discovered in form of classification rules encoding the conditions to fill a pre-defined information template. An application to a real-world dataset composed by publications selected to support biologists in the task of automatic annotation of a genomic database is reported.

database and expert systems applications | 2004

A data mining approach to PubMed query refinement

Margherita Berardi; Michele Lapi; Pietro Leo; Donato Malerba; Caterina Marinelli; Gaetano Scioscia

Finding disease relationships requires laborious examination of hundreds of possible candidate heterogeneous factors. Much of the related information is currently contained in biological and medical journals, making biomedical text mining a central bioinformatic problem. More than 14 million abstracts of such papers are contained in the Medline collection and are available online. In this paper we present a data mining engine, namely MeSH Terms Associator (MTA), that has been employed in a distributed architecture to refine a generic PubMed query by means of discovery of concept relations in the form of association rules. However, the number of discovered association rules is usually high and the interest of most of them does not fulfil user expectations. In addition, the presentation of thousands of rules can discourage users from interpreting them. To overcome this problem we investigate the application of some filtering techniques. Experimental results on datasets corresponding to real-world biomedical queries are discussed and future directions are drawn.

international conference on data mining | 2006

Segmentation of Evolving Complex Data and Generation of Models

Corrado Loglisci; Margherita Berardi

The problem of time-series segmentation has been widely discussed and it has been successfully applied in a variety of areas including computational genomics, telecommunications and process monitoring. Nevertheless not many techniques have been devised to deal with multidimensional evolving data describing complex objects. Moreover, in many applications the resulting segments have not a description understandable to the user, and this is exacerbated in the applications with complex data. Our contribute aims to propose an algorithmic framework to segment multidimensional evolving data or multidimensional time-series and to resort to an ILP system to generate characterizations of segments close to the user. The application and the results to the real-world data are reported

Explore More