Emanuele Marino
University of Florence
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Emanuele Marino.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2006
Simone Marinai; Emanuele Marino; Giovanni Soda
We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of self organizing maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals
international conference on document analysis and recognition | 2003
Simone Marinai; Emanuele Marino; Giovanni Soda
This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for efficient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a fixed-length description that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to different languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more difficulties. Experimental results describe three application scenarios having various levels of difficulty for current OCR systems.
First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings. | 2004
Simone Marinai; Emanuele Marino; Francesca Cesarini; Giovanni Soda
Large collections of scanned documents (books and journals) are now available in digital libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems. We describe a system that we developed in order to retrieve information from digitized books and journals belonging to digital libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performing OCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.
international conference on document analysis and recognition | 2005
Simone Marinai; Emanuele Marino; Giovanni Soda
We analyze a system for the retrieval of document images on the basis of layout similarity. Layout objects are extracted and represented with the XY tree. Page similarity is computed with a tree-edit distance algorithm. The peculiarity of the approach is the use of tree grammars to model the variations in the tree, which are due to segmentation algorithms or to structural differences between documents with similar layout. A few class-independent grammatical rules are used to modify each tree and obtain a reduced tree that is supposed to preserve the most relevant features of the page.
document engineering | 2010
Simone Marinai; Emanuele Marino; Giovanni Soda
We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semi-automatic conversion of PDF documents in the Epub format that can be read on several E-book devices. Among various sub-tasks, the ToC extraction and recognition is particularly useful for an easy navigation of book contents. The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project
Second International Conference on Document Image Analysis for Libraries (DIAL'06) | 2006
Simone Marinai; Emanuele Marino; Giovanni Soda
We describe a system for the retrieval on the basis of layout similarity of document images belonging to collections stored in digital libraries. Layout regions are extracted and represented with the XY tree. The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis. The combination of these techniques allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document
international conference on document analysis and recognition | 2011
Simone Marinai; Emanuele Marino; Giovanni Soda
In the last years the interest in e-book readers is significantly growing. Two main document formats are supported by most devices: PDF and ePub. The PDF format is widely used to share documents allowing a cross-platform readability. However, it is not ideal for a comfortable reading on small screens. On the opposite, the ePub format is re-flowable and it is well suited for e-book readers. In this paper we describe a system for the conversion of PDF books to the ePub format aiming at inverting the text formatting made during the pagination. To this purpose, layout analysis techniques are performed to identify the books table of contents and the main functional regions such as chapters, paragraphs, and notes.
Machine Learning in Document Analysis and Recognition | 2008
Simone Marinai; Emanuele Marino; Giovanni Soda
In this chapter, we discuss the use of Self Organizing Maps (SOM) to deal with various tasks in Document Image Analysis. The SOM is a particular type of artificial neural network that computes, during the learning, an unsupervised clustering of the input data arranging the cluster centers in a lattice. After an overview of the previous applications of unsupervised learning in document image analysis, we present our recent work in the field. We describe the use of the SOM at three processing levels: the character clustering, the word clustering, and the layout clustering, with applications to word retrieval, document retrieval and page classification. In order to improve the clustering effectiveness, when dealing with small training sets, we propose an extension of the SOM training algorithm that considers the tangent distance so as to increase the SOM robustness with respect to small transformations of the patterns. Experiments on the use of this extended training algorithm are reported for both character and page layout clustering.
document analysis systems | 2006
Simone Marinai; Stefano Faini; Emanuele Marino; Giovanni Soda
We propose an approach for efficient word retrieval from printed documents belonging to Digital Libraries. The approach combines word image clustering (based on Self Organizing Maps, SOM) with Principal Component Analysis. The combination of these methods allows us to efficiently retrieve the matching words from large documents collections without the need for a direct comparison of the query word with each indexed word.
european conference on research and advanced technology for digital libraries | 2007
Simone Marinai; Emanuele Marino; Giovanni Soda
In this paper, we describe a system to perform Document Image Retrieval in Digital Libraries. The system allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR. The system is discussed in the context of recent applications of document image retrieval in the field of Digital Libraries. We present the different techniques in a single framework in which the emphasis is put on the representation level at which the similarity between the query and the indexed documents is computed. We also report the results of some recent experiments on the use of layout-based document image retrieval.