Giansalvatore Mecca
University of Basilicata
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Giansalvatore Mecca.
symposium on principles of database systems | 1997
Paolo Atzeni; Giansalvatore Mecca
The paper develops EDITOR, a language for manipulating semi-structured documents, such as the ones typically available on the Web. EDITOR programs allow to search and restructure a document. They are based on two simple ideas, taken from text editors: Search” instructions are used to select regions of interest in a document, and “cut .!Y paste” to restructure them. We study the expressive power and the complexity of these programs. We show that they are computationally complete, in the sense that any computable document restructuring can be expressed in EDITOR. We also study the complexity of a safe subclass of programs, showing that it captures exactly the class of polynomial-time restructurings. The language has been implemented in Java, and is used in the ARANEUS project to build database views over Web sites.
extending database technology | 1998
Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo
A methodology for designing and maintaining large Web sites is introduced. It would be especially useful if data to be published in the site are managed using a DBMS. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models, proposed in our project. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it.
Journal of the ACM | 2004
Valter Crescenzi; Giansalvatore Mecca
Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.
Information Systems | 1998
Valter Crescenzi; Giansalvatore Mecca
Abstract Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva , a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.
international conference on management of data | 1998
Giansalvatore Mecca; Paolo Atzeni; A. Masci; G. Sindoni; Paolo Merialdo
The paper describes the ARANEUS Wel-Base Management System [l, 5, 4, 61, a system developed at Universitb di Roma Tre, which represents a proposal towards the definition of a new kind of data-repository, designed to manage Web data in the database style. We call a WebBase a collection of data of heterogeneous nature, and more specifically: (i) highly structured data, such as the ones typically stored in relational or objectoriented database systems; (G) semistructured data, in the Web style. We can simplify by saying that it incorporates both databases and Web sites. A Web-Base Management System (WBMS) is a system for managing such Web-bases. More specifically, it should provide functionalities for both database and Web site management. It is natural to think of it as an evolution of ordinary DBMSs, in the sense that it will play in future generation Web-based Information Systems the same role as the one played by database systems today. Three natural requirements arise here:first, the system should be fully distributed: databases and Web sites may be either local or remote resources; second, it should be platform-independent, i.e., it should not be tied to a specific platform or software environment, coherently with the nature of the Internet; finally, all system functionalities should be accessible through a hypertextual user interface, based on HTML-like markup languages, i.e., the system should be a site itself. We can list three main classes of applications that a WBMS should support, in the database spirit: (1) queries: the system should allow to access data in a Web-base in a declarative, high-level fashion; this means that not only structured data can be accessed and queried, but also semistructured data in Web sites; (2) views: data coming from heterogeneous sources should be possibly reorganized and integrated in new Web-bases, in order to provide different views over the original data, to be navigated and queried by end-users; (3) updates: the process of maintaining Web sites is a delicate one which should be carefully supported;
data and knowledge engineering | 2007
Giansalvatore Mecca; Salvatore Raunich; Alessandro Pappalardo
We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy - called dynamic SVD clustering - to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.
international conference on management of data | 1997
Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo
Database systems offer efficient and reliable technology to query structured data. However, because of the explosion of the World Wide Web [11], an increasing amount of information is stored in repositories organized according to less rigid structures, usually as hypertextual documents, and data access is based on browsing and information retrieval techniques. Since browsing and search engines present important limitations [8], several query languages [19, 20, 23] for the Web have been recently proposed. These approaches are mainly based on a loose notion of structure, and tend to see the Web as a huge collection of unstructured objects, organized as a graph. Clearly, traditional database techniques are of little use in this field, and new techniques need to be developed. In this paper, we present the approach to the management of Web data as attacked in the ArtANEUS project carried out by the database group at Universith di l=toma Tre. Our approach is based on a generalization of the notion of view to the Web framework. In fact, in traditional databases, views represent an essential tool for restructuring and integrating da ta to be presented to the user. Since the Web is becoming a major computing platform and a uniform interface for sharing data, we believe that also in this field a sophisticate view mechanism is needed, with novel features due to the semi-structured nature of the Web. First, in this context, restructuring and presenting da ta under different perspectives requires the generation of derived Web hypertexts, in order to re-organize and re-use portions of the Web. To do this, da ta from existing Web sites must be extracted, and then queried and integrated in order to build new hypertexts, i.e., hypertextual views over the original sites; these manipulations can be better attained in a more structured framework, in which traditional database technology can be leveraged to analyze and correlate information. Therefore, there seem to be different view levels in this framework: (i) at the first level, da ta are extracted from the sites of interest and given a database structure, which represents a first structured view over the original semi-structured data; (ii) then, further database views can be built by means of reorganizations and integrations based on traditional database techniques; (iii) finally, a derived hypertext can be generated offering an alternative or integrated hypertextual view over the original sites. In the process, data go from a loosely structured organizat ion-the Web pages-to a very structured onethe database--and then again to Web structures.
very large data bases | 2013
Floris Geerts; Giansalvatore Mecca; Paolo Papotti; Donatello Santoro
Data-cleaning (or data-repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a set of given constraints. In recent years, repairing methods have been proposed for several classes of constraints. However, these methods rely on ad hoc decisions and tend to hard-code the strategy to repair conflicting values. As a consequence, there is currently no general algorithm to solve database repairing problems that involve different kinds of constraints and different strategies to select preferred values. In this paper we develop a uniform framework to solve this problem. We propose a new semantics for repairs, and a chase-based algorithm to compute minimal solutions. We implemented the framework in a DBMS-based prototype, and we report experimental results that confirm its good scalability and superior quality in computing repairs.
international conference on management of data | 2002
Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo
Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a web page, and reorganizes them in a more structured format. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages [1]. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a dataintensive web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique [2], that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching. Our demonstration presents RoadRunner, our prototype that implements matching and its companion techniques. We have conducted several experiments on pages from real life web sites; these experiences have shown the effectiveness of the approach, as well as the efficiency of the system [2]. The matching technique for wrapper inference [2] is based on an iterative process; at every step, matching works on two objects at a time: (i) an input page, which represented as a list of tokens (each token is either a tag or a text field), and (ii) a wrapper, expressed as a regular expression. The process starts by taking one input page as an initial version of the wrapper; then, the wrapper is matched against the sample and it is progressively refined trying to solve mismatches: a mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper. Mismatches can be solved by generalizing the wrapper. The process succeeds if a common wrapper can be generated by solving all mismatches encountered.
ACM Transactions on Internet Technology | 2003
Paolo Merialdo; Paolo Atzeni; Giansalvatore Mecca
Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual implementation of the site; (b) a tool called Homer, conceived to support the site design and implementation process, by allowing the designer to move through the various steps of the methodology, and to automate the generation of the code needed to implement the actual site.Our approach to site design is based on a clear separation between several design activities, namely database design, hypertext design, and presentation design. All these activities are carried on by using high-level models, all subsumed by an extension of the nested relational model; the mappings between the models can be nicely expressed using an extended relational algebra for nested structures. Based on the design artifacts produced during the design process, and on their representation in the algebraic framework, Homer is able to generate all the code needed for the actual generation of the site, in a completely automatic way.