Paolo Merialdo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paolo Merialdo is active.

Explore More

Publication

Featured researches published by Paolo Merialdo.

extending database technology | 1998

Design and Maintenance of Data-Intensive Web Sites

Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo

A methodology for designing and maintaining large Web sites is introduced. It would be especially useful if data to be published in the site are managed using a DBMS. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models, proposed in our project. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it.

international conference on management of data | 1998

The Araneus Web-based management system

Giansalvatore Mecca; Paolo Atzeni; A. Masci; G. Sindoni; Paolo Merialdo

The paper describes the ARANEUS Wel-Base Management System [l, 5, 4, 61, a system developed at Universitb di Roma Tre, which represents a proposal towards the definition of a new kind of data-repository, designed to manage Web data in the database style. We call a WebBase a collection of data of heterogeneous nature, and more specifically: (i) highly structured data, such as the ones typically stored in relational or objectoriented database systems; (G) semistructured data, in the Web style. We can simplify by saying that it incorporates both databases and Web sites. A Web-Base Management System (WBMS) is a system for managing such Web-bases. More specifically, it should provide functionalities for both database and Web site management. It is natural to think of it as an evolution of ordinary DBMSs, in the sense that it will play in future generation Web-based Information Systems the same role as the one played by database systems today. Three natural requirements arise here:first, the system should be fully distributed: databases and Web sites may be either local or remote resources; second, it should be platform-independent, i.e., it should not be tied to a specific platform or software environment, coherently with the nature of the Internet; finally, all system functionalities should be accessible through a hypertextual user interface, based on HTML-like markup languages, i.e., the system should be a site itself. We can list three main classes of applications that a WBMS should support, in the database spirit: (1) queries: the system should allow to access data in a Web-base in a declarative, high-level fashion; this means that not only structured data can be accessed and queried, but also semistructured data in Web sites; (2) views: data coming from heterogeneous sources should be possibly reorganized and integrated in new Web-bases, in order to provide different views over the original data, to be navigated and queried by end-users; (3) updates: the process of maintaining Web sites is a delicate one which should be carefully supported;

international conference on management of data | 1997

Semistructured and structured data in the Web: going back and forth

Paolo Atzeni; Giansalvatore Mecca; Paolo Merialdo

Database systems offer efficient and reliable technology to query structured data. However, because of the explosion of the World Wide Web [11], an increasing amount of information is stored in repositories organized according to less rigid structures, usually as hypertextual documents, and data access is based on browsing and information retrieval techniques. Since browsing and search engines present important limitations [8], several query languages [19, 20, 23] for the Web have been recently proposed. These approaches are mainly based on a loose notion of structure, and tend to see the Web as a huge collection of unstructured objects, organized as a graph. Clearly, traditional database techniques are of little use in this field, and new techniques need to be developed. In this paper, we present the approach to the management of Web data as attacked in the ArtANEUS project carried out by the database group at Universith di l=toma Tre. Our approach is based on a generalization of the notion of view to the Web framework. In fact, in traditional databases, views represent an essential tool for restructuring and integrating da ta to be presented to the user. Since the Web is becoming a major computing platform and a uniform interface for sharing data, we believe that also in this field a sophisticate view mechanism is needed, with novel features due to the semi-structured nature of the Web. First, in this context, restructuring and presenting da ta under different perspectives requires the generation of derived Web hypertexts, in order to re-organize and re-use portions of the Web. To do this, da ta from existing Web sites must be extracted, and then queried and integrated in order to build new hypertexts, i.e., hypertextual views over the original sites; these manipulations can be better attained in a more structured framework, in which traditional database technology can be leveraged to analyze and correlate information. Therefore, there seem to be different view levels in this framework: (i) at the first level, da ta are extracted from the sites of interest and given a database structure, which represents a first structured view over the original semi-structured data; (ii) then, further database views can be built by means of reorganizations and integrations based on traditional database techniques; (iii) finally, a derived hypertext can be generated offering an alternative or integrated hypertextual view over the original sites. In the process, data go from a loosely structured organizat ion-the Web pages-to a very structured onethe database--and then again to Web structures.

international conference on management of data | 2002

RoadRunner: automatic data extraction from data-intensive web sites

Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a web page, and reorganizes them in a more structured format. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages [1]. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a dataintensive web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique [2], that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching. Our demonstration presents RoadRunner, our prototype that implements matching and its companion techniques. We have conducted several experiments on pages from real life web sites; these experiences have shown the effectiveness of the approach, as well as the efficiency of the system [2]. The matching technique for wrapper inference [2] is based on an iterative process; at every step, matching works on two objects at a time: (i) an input page, which represented as a list of tokens (each token is either a tag or a text field), and (ii) a wrapper, expressed as a regular expression. The process starts by taking one input page as an initial version of the wrapper; then, the wrapper is matched against the sample and it is progressively refined trying to solve mismatches: a mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper. Mismatches can be solved by generalizing the wrapper. The process succeeds if a common wrapper can be generated by solving all mismatches encountered.

data and knowledge engineering | 2005

Clustering web pages based on their structure

Valter Crescenzi; Paolo Merialdo; Paolo Missier

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

ACM Transactions on Internet Technology | 2003

Design and development of data-intensive web sites: The Araneus approach

Paolo Merialdo; Paolo Atzeni; Giansalvatore Mecca

Data-intensive Web sites are large sites based on a back-end database, with a fairly complex hypertext structure. The paper develops two main contributions: (a) a specific design methodology for data-intensive Web sites, composed of a set of steps and design transformations that lead from a conceptual specification of the domain of interest to the actual implementation of the site; (b) a tool called Homer, conceived to support the site design and implementation process, by allowing the designer to move through the various steps of the methodology, and to automate the generation of the code needed to implement the actual site.Our approach to site design is based on a clear separation between several design activities, namely database design, hypertext design, and presentation design. All these activities are carried on by using high-level models, all subsumed by an extension of the nested relational model; the mappings between the models can be nicely expressed using an extended relational algebra for nested structures. Based on the design artifacts produced during the design process, and on their representation in the algebraic framework, Homer is able to generate all the code needed for the actual generation of the site, in a completely automatic way.

conference on advanced information systems engineering | 2010

Probabilistic models to reconcile complex data from inaccurate data sources

Lorenzo Blanco; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.

international world wide web conferences | 2001

Data-Intensive Web Sites: Design and Maintenance

Paolo Atzeni; Paolo Merialdo; Giansalvatore Mecca

A methodology for designing and maintaining data–intensive Web sites is introduced. Leveraging on ideas well established in the database field, the approach heavily relies on the use of models for the description of Web sites. The design process is composed of two intertwined activities: database design and hypertext design. Each of these is further divided in a conceptual phase and a logical phase, based on specific data models. The methodology strongly supports site maintenance: in fact, the various models provide a concise description of the site structure; they allow to reason about the overall organization of pages in the site and possibly to restructure it.

very large data bases | 2013

Extraction and integration of partially overlapping web sources

Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.

international world wide web conferences | 2013

A framework for learning web wrappers from the crowd

Valter Crescenzi; Paolo Merialdo; Disheng Qiu

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.

Explore More