Valter Crescenzi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Valter Crescenzi is active.

Explore More

Publication

Featured researches published by Valter Crescenzi.

Journal of the ACM | 2004

Automatic information extraction from large websites

Valter Crescenzi; Giansalvatore Mecca

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

Information Systems | 1998

Grammars have exceptions

Valter Crescenzi; Giansalvatore Mecca

Abstract Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva , a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.

international conference on management of data | 2002

RoadRunner: automatic data extraction from data-intensive web sites

Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a web page, and reorganizes them in a more structured format. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages [1]. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a dataintensive web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique [2], that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching. Our demonstration presents RoadRunner, our prototype that implements matching and its companion techniques. We have conducted several experiments on pages from real life web sites; these experiences have shown the effectiveness of the approach, as well as the efficiency of the system [2]. The matching technique for wrapper inference [2] is based on an iterative process; at every step, matching works on two objects at a time: (i) an input page, which represented as a list of tokens (each token is either a tag or a text field), and (ii) a wrapper, expressed as a regular expression. The process starts by taking one input page as an initial version of the wrapper; then, the wrapper is matched against the sample and it is progressively refined trying to solve mismatches: a mismatch happens when some token in the sample does not comply to the grammar specified by the wrapper. Mismatches can be solved by generalizing the wrapper. The process succeeds if a common wrapper can be generated by solving all mismatches encountered.

data and knowledge engineering | 2005

Clustering web pages based on their structure

Valter Crescenzi; Paolo Merialdo; Paolo Missier

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

conference on advanced information systems engineering | 2010

Probabilistic models to reconcile complex data from inaccurate data sources

Lorenzo Blanco; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach.

very large data bases | 2013

Extraction and integration of partially overlapping web sources

Mirko Bronzi; Valter Crescenzi; Paolo Merialdo; Paolo Papotti

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.

international world wide web conferences | 2013

A framework for learning web wrappers from the crowd

Valter Crescenzi; Paolo Merialdo; Disheng Qiu

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.

acm symposium on applied computing | 2002

Wrapping-oriented classification of web pages

Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo

Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.

international conference on conceptual modeling | 2001

Automatic Web Information Extraction in the ROADRUNNER System

Valter Crescenzi; Giansalvatore Mecca; Paolo Merialdo

This paper presents Road Runner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the Road Runner System, and the novel research challenges posed by the attempt to automate the information extraction process.

Applied Artificial Intelligence | 2008

WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Valter Crescenzi; Paolo Merialdo

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.

Explore More