Francesca Cesarini
University of Florence
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Francesca Cesarini.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 1998
Francesca Cesarini; Marco Gori; Simone Marinai; Giovanni Soda
We describe a flexible form-reader system capable of extracting textual information from accounting documents, like invoices and bills of service companies. In this kind of document, the extraction of some information fields cannot take place without having detected the corresponding instruction fields, which are only constrained to range in given domains. We propose modeling the documents layout by means of attributed relational graphs, which turn out to be very effective for form registration, as well as for performing a focused search for instruction fields. This search is carried out by means of a hybrid model, where proper algorithms, based on morphological operations and connected components, are integrated with connectionist models. Experimental results are given in order to assess the actual performance of the system.
International Journal on Document Analysis and Recognition | 2001
Enrico Appiani; Francesca Cesarini; Anna Maria Colla; Michelangelo Diligenti; Marco Gori; Simone Marinai; Giovanni Soda
Abstract. In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution.
international conference on document analysis and recognition | 1999
Francesca Cesarini; Marco Gori; Simone Marinai; Giovanni Soda
We describe a top-down approach to the segmentation and representation of documents containing tabular structures. Examples of these documents are invoices and technical papers with tables. The segmentation is based on an extension of X-Y trees, where the regions are split by means of cuts along separators (e.g. lines), in addition to cuts along white spaces. The leaves describe regions containing homogeneous information and cutting separators. Adjacency links among leaves of the tree describe local relationships between corresponding regions.
international conference on document analysis and recognition | 1997
Francesca Cesarini; Enrico Francesconi; Marco Gori; Simone Marinai; Jianqing Sheng; Giovanni Soda
Much attention has recently been paid to the recognition of graphical objects, such as company logos and trademarks. Recognizing these objects facilitates the recognition of document classes. Some promising results have been achieved by using autoassociator-based artificial neural networks (AANN) in the presence of homogeneously distributed noise. However, the performance drops significantly when dealing with spot-noisy logos, where strips or blobs produce a partial obstruction of the pictures. We propose a new approach for training AANNs especially conceived for dealing with spot noise. The basic idea is to introduce new metrics for assessing the reproduction error in AANNs. The proposed algorithm, referred to as spot-backpropagation (S-BP), is significantly more robust with respect to spot-noise than classical Euclidean norm-based backpropagation (BP). Our experimental results are based on a database of 88 real logos that are artificially corrupted by spot-noise.
international conference on pattern recognition | 2002
Francesca Cesarini; Simone Marinai; L. Sarti; Giovanni Soda
We describe an approach for table location in document images. The documents are described by means of a hierarchical representation that is based on the MXY tree. The presence of a table is hypothesized by searching parallel lines in the MXY tree of the page. This hypothesis is afterwards verified by locating perpendicular lines or white spaces in the region included between the parallel lines. Lastly, located tables can be merged on the basis of proximity and similarity criteria. The use of an optimization method, that relies on the definition of an appropriate table location index, allows us to identify, the optimal values of thresholds involved in the algorithm. In this way the algorithm can be adapted to recognize tables with different features by maximizing the performance on an appropriate training set. The algorithm has been evaluated on two data-sets containing more than 1500 pages, and comparing its results with the tables identified by two commercial OCRs.
First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings. | 2004
Simone Marinai; Emanuele Marino; Francesca Cesarini; Giovanni Soda
Large collections of scanned documents (books and journals) are now available in digital libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems. We describe a system that we developed in order to retrieve information from digitized books and journals belonging to digital libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performing OCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.
International Journal on Document Analysis and Recognition | 2003
Francesca Cesarini; Enrico Francesconi; Marco Gori; Giovanni Soda
Abstract.In this paper a system for processing documents that can be grouped into classes is illustrated. We have considered invoices as a case-study. The system is divided into three phases: document analysis, classification, and understanding. We illustrate the analysis and understanding phases. The system is based on knowledge constructed by means of a learning procedure. The experimental results demonstrate the reliability of our document analysis and understanding procedures. They also present evidence that it is possible to use a small learning set of invoices to obtain reliable knowledge for the understanding phase.
IEEE Transactions on Knowledge and Data Engineering | 1996
Alessandro Artale; Francesca Cesarini; Giovanni Soda
We formally investigate the structural similarities and differences existing between object database models and concept languages establishing a correspondence between the two environments. Object database models deal with two kinds of data: individual objects, which have an identity, and values, which can be basic values or can have complex structures containing both basic values and objects. Concept languages only deal with individual objects. The correspondence points out the different role played by objects and values in both approaches and defines a way of properly mapping database descriptions into concept language descriptions at both a terminological and assertional level. Once the mapping is achieved, object databases can take advantage of both the algorithms and the results concerning their complexity developed in concept languages.
document analysis systems | 2002
Francesca Cesarini; Simone Marinai; Giovanni Soda
Document image retrieval can be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a tree-based representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both global features of the page and a vectorial representation of its layout that is derived from the corresponding MXY tree. Occurrences of tree patterns are handled similarly to index terms in Information Retrieval in order to compute the similarity. When retrieving relevant documents, the images in the collection are sorted on the basis of a measure that is the combination of two values describing the similarity of global features and of the occurrences of tree patterns. The system is applied to the retrieval of documents belonging to digital libraries. Tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century, and to a collection of monographs printed in the same Century and containing more than 600 pages.
graphics recognition | 1995
Francesca Cesarini; Marco Gori; Simone Marinai; Giovanni Soda
This paper addresses the problem of locating and recognizing graphic items in document images. The proposed approach allows us to recognize such items also in the presence of high noise, scaling, and rotation. This is accomplished by a hybrid model which performs graphic item location by morphological operations and connected component analysis, and item recognition by a proper connectionist model. Some very promising experimental results are reported to support the proposed algorithms.