Giansalvatore Mecca | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giansalvatore Mecca is active.

Explore More

Publication

Featured researches published by Giansalvatore Mecca.

symposium on principles of database systems | 1997

Cut and paste

Paolo Atzeni; Giansalvatore Mecca

The paper develops EDITOR, a language for manipulating semi-structured documents, such as the ones typically available on the Web. EDITOR programs allow to search and restructure a document. They are based on two simple ideas, taken from text editors: Search” instructions are used to select regions of interest in a document, and “cut .!Y paste” to restructure them. We study the expressive power and the complexity of these programs. We show that they are computationally complete, in the sense that any computable document restructuring can be expressed in EDITOR. We also study the complexity of a safe subclass of programs, showing that it captures exactly the class of polynomial-time restructurings. The language has been implemented in Java, and is used in the ARANEUS project to build database views over Web sites.

extending database technology | 1998

Design and Maintenance of Data-Intensive Web Sites

Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo

A methodology for designing and maintaining large Web sites is introduced. It would be especially useful if data to be published in the site are managed using a DBMS. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models, proposed in our project. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it.

Journal of the ACM | 2004

Automatic information extraction from large websites

Valter Crescenzi; Giansalvatore Mecca

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

Information Systems | 1998

Grammars have exceptions

Valter Crescenzi; Giansalvatore Mecca

Abstract Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva , a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.

international conference on management of data | 1998

The Araneus Web-based management system

Giansalvatore Mecca; Paolo Atzeni; A. Masci; G. Sindoni; Paolo Merialdo

The paper describes the ARANEUS Wel-Base Management System [l, 5, 4, 61, a system developed at Universitb di Roma Tre, which represents a proposal towards the definition of a new kind of data-repository, designed to manage Web data in the database style. We call a WebBase a collection of data of heterogeneous nature, and more specifically: (i) highly structured data, such as the ones typically stored in relational or objectoriented database systems; (G) semistructured data, in the Web style. We can simplify by saying that it incorporates both databases and Web sites. A Web-Base Management System (WBMS) is a system for managing such Web-bases. More specifically, it should provide functionalities for both database and Web site management. It is natural to think of it as an evolution of ordinary DBMSs, in the sense that it will play in future generation Web-based Information Systems the same role as the one played by database systems today. Three natural requirements arise here:first, the system should be fully distributed: databases and Web sites may be either local or remote resources; second, it should be platform-independent, i.e., it should not be tied to a specific platform or software environment, coherently with the nature of the Internet; finally, all system functionalities should be accessible through a hypertextual user interface, based on HTML-like markup languages, i.e., the system should be a site itself. We can list three main classes of applications that a WBMS should support, in the database spirit: (1) queries: the system should allow to access data in a Web-base in a declarative, high-level fashion; this means that not only structured data can be accessed and queried, but also semistructured data in Web sites; (2) views: data coming from heterogeneous sources should be possibly reorganized and integrated in new Web-bases, in order to provide different views over the original data, to be navigated and queried by end-users; (3) updates: the process of maintaining Web sites is a delicate one which should be carefully supported;

data and knowledge engineering | 2007

A new algorithm for clustering search results

Giansalvatore Mecca; Salvatore Raunich; Alessandro Pappalardo

We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy - called dynamic SVD clustering - to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.

international conference on management of data | 1997

Semistructured and structured data in the Web: going back and forth

Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo

Database systems offer efficient and reliable technology to query structured data. However, because of the explosion of the World Wide Web [11], an increasing amount of information is stored in repositories organized according to less rigid structures, usually as hypertextual documents, and data access is based on browsing and information retrieval techniques. Since browsing and search engines present important limitations [8], several query languages [19, 20, 23] for the Web have been recently proposed. These approaches are mainly based on a loose notion of structure, and tend to see the Web as a huge collection of unstructured objects, organized as a graph. Clearly, traditional database techniques are of little use in this field, and new techniques need to be developed. In this paper, we present the approach to the management of Web data as attacked in the ArtANEUS project carried out by the database group at Universith di l=toma Tre. Our approach is based on a generalization of the notion of view to the Web framework. In fact, in traditional databases, views represent an essential tool for restructuring and integrating da ta to be presented to the user. Since the Web is becoming a major computing platform and a uniform interface for sharing data, we believe that also in this field a sophisticate view mechanism is needed, with novel features due to the semi-structured nature of the Web. First, in this context, restructuring and presenting da ta under different perspectives requires the generation of derived Web hypertexts, in order to re-organize and re-use portions of the Web. To do this, da ta from existing Web sites must be extracted, and then queried and integrated in order to build new hypertexts, i.e., hypertextual views over the original sites; these manipulations can be better attained in a more structured framework, in which traditional database technology can be leveraged to analyze and correlate information. Therefore, there seem to be different view levels in this framework: (i) at the first level, da ta are extracted from the sites of interest and given a database structure, which represents a first structured view over the original semi-structured data; (ii) then, further database views can be built by means of reorganizations and integrations based on traditional database techniques; (iii) finally, a derived hypertext can be generated offering an alternative or integrated hypertextual view over the original sites. In the process, data go from a loosely structured organizat ion-the Web pages-to a very structured onethe database--and then again to Web structures.

very large data bases | 2013

The LLUNATIC data-cleaning framework

Floris Geerts; Giansalvatore Mecca; Paolo Papotti; Donatello Santoro

Data-cleaning (or data-repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a set of given constraints. In recent years, repairing methods have been proposed for several classes of constraints. However, these methods rely on ad hoc decisions and tend to hard-code the strategy to repair conflicting values. As a consequence, there is currently no general algorithm to solve database repairing problems that involve different kinds of constraints and different strategies to select preferred values. In this paper we develop a uniform framework to solve this problem. We propose a new semantics for repairs, and a chase-based algorithm to compute minimal solutions. We implemented the framework in a DBMS-based prototype, and we report experimental results that confirm its good scalability and superior quality in computing repairs.

international conference on management of data | 2002

RoadRunner: automatic data extraction from data-intensive web sites

Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a web page, and reorganizes them in a more structured format. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages [1]. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a dataintensive web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique [2], that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching. Our demonstration presents RoadRunner, our prototype that implements matching and its companion techniques. We have conducted several experiments on pages from real life web sites; these experiences have shown the effectiveness of the approach, as well as the efficiency of the system [2]. The matching technique for wrapper inference [2] is based on an iterative process; at every step, matching works on two objects at a time: (i) an input page, which represented as a list of tokens (each token is either a tag or a text field), and (ii) a wrapper, expressed as a regular expression. The process starts by taking one input page as an initial version of the wrapper; then, the wrapper is matched against the sample and it is progressively refined trying to solve mismatches: a mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper. Mismatches can be solved by generalizing the wrapper. The process succeeds if a common wrapper can be generated by solving all mismatches encountered.

ACM Transactions on Internet Technology | 2003

Design and development of data-intensive web sites: The Araneus approach

Paolo Merialdo; Paolo Atzeni; Giansalvatore Mecca

Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual implementation of the site; (b) a tool called Homer, conceived to support the site design and implementation process, by allowing the designer to move through the various steps of the methodology, and to automate the generation of the code needed to implement the actual site.Our approach to site design is based on a clear separation between several design activities, namely database design, hypertext design, and presentation design. All these activities are carried on by using high-level models, all subsumed by an extension of the nested relational model; the mappings between the models can be nicely expressed using an extended relational algebra for nested structures. Based on the design artifacts produced during the design process, and on their representation in the algebraic framework, Homer is able to generate all the code needed for the actual generation of the site, in a completely automatic way.

Explore More