Juan Raposo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan Raposo is active.

Explore More

Publication

Featured researches published by Juan Raposo.

data and knowledge engineering | 2008

Extracting lists of data records from semi-structured web pages

Manuel Álvarez; Alberto Pan; Juan Raposo; Fernando Bellas; Fidel Cacheda

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have also tested our techniques with a high number of real web sources and we have found them to be very effective.

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context | 2002

Semi-Automatic Wrapper Generation for Commercial Web Sources

Alberto Pan; Juan Raposo; Manuel Álvarez; Justo Hidalgo; Ángel Viña

Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present WARGO, a semiautomatic wrapper generation tool, which has been used by non-programmer staff to successfully wrap more than 700 commercial web sources in several industrial applications. We describe our approach for wrapper generation and show the difficulties found with other systems for wrapping this kind of sources.

database and expert systems applications | 2002

The Wargo system: semi-automatic wrapper generation in presence of complex data access modes

Juan Raposo; Alberto Pan; Manuel Álvarez; Justo Hidalgo; Ángel Viña

Semi-automatic wrapper generation tools aim to ease the task of building structured views over Web sources. But the wrapper generation techniques presented to date show several weaknesses when dealing with the complex commercial Web sources of today, especially when constructing advanced navigational sequences for accessing data. We present Wargo, a semi-automatic wrapper generation tool, which has been used by non-programmer staff to successfully wrap more than 700 commercial Web sources in several industrial applications.

very large data bases | 2002

The denodo data integration platform

Alberto Pan; Juan Raposo; Manuel Álvarez; Paula Montoto; Vicente Orjales; Justo Hidalgo; Lucía Ardao; Anastasio Molano; Ángel Viña

The world today is characterised by the proliferation of information sources available through media such as the WWW, databases, semi-structured files (e.g. XML documents), etc. Nevertheless, this information is usually scattered, heterogeneous and weakly structured, so it is difficult to process it automatically. DENODO Corporation has developed a mediator system for the construction of semi-structured and structured data integration applications. This system has already been used in the construction of several applications on the Internet and in corporate environments, which are currently deployed at several important Internet audience sites and large sized business corporations. In this extended abstract, we present an overview of the system and we put forward some conclusions arising from our experience in building real-world data integration applications, focusing in some challenges we believe require more attention from the research community.

international conference on web engineering | 2009

Automating Navigation Sequences in AJAX Websites

Paula Montoto; Alberto Pan; Juan Raposo; Fernando Bellas; Javier Lopez

Web automation applications are widely used for different purposes such as B2B integration, automated testing of web applications or technology and business watch. One crucial part in web automation applications is to allow easily generating and reproducing navigation sequences. Previous proposals in the literature assumed a navigation model today turned obsolete by the new breed of AJAX-based websites. Although some open-source and commercial tools have also addressed the problem, they show significant limitations either in usability or their ability to deal with complex websites. In this paper, we propose a set of new techniques to deal with this problem. Our main contributions are a new method for recording navigation sequences supporting a wider range of events, and a novel method to detect when the effects caused by a user action have finished. We have evaluated our approach with more than 100 web applications, obtaining very good results.

web age information management | 2006

Crawling web pages with support for client-side dynamism

Manuel Álvarez; Alberto Pan; Juan Raposo; Justo Hidalgo

There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms.

data and knowledge engineering | 2011

Automated browsing in AJAX websites

Paula Montoto; Alberto Pan; Juan Raposo; Fernando Bellas; Javier Lopez

Web automation applications are widely used for different purposes such as B2B integration, automated testing of web applications or technology and business watch. One crucial part in web automation applications is for them to easily generate and reproduce navigation sequences. This problem is specially complicated in the case of the new breed of AJAX-based websites. Although recently some tools have also addressed the problem, they show some limitations either in usability or their ability to deal with complex websites. In this paper, we propose a set of new techniques to build an automatic web navigation system able to deal with these complexities. Our main contributions are: a new method for recording navigation sequences able to scale to a wider range of events, an algorithm to identify in a change-resilient manner the target element of a user action, and a novel method to detect when the effects caused by a user action (including the effects of scripting code and AJAX requests) have finished. In addition, we have also tested our approach with a high number of real web sources and have compared it with other relevant web automation tools obtaining very good results.

signal processing systems | 2010

Finding and Extracting Data Records from Web Pages

Manuel Álvarez; Alberto Pan; Juan Raposo; Fernando Bellas; Fidel Cacheda

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

ieee international conference on e-commerce technology for dynamic e-business | 2004

Client-side deep Web data extraction

Manuel Álvarez; Alberto Pan; Juan Raposo; Ángel Viña

The problem of data extraction from the deep Web can be divided into two tasks: crawling the client-side and the server-side deep Web. The objective is to define an architecture and a set of related techniques to access the information placed in the client-side deep Web. This involves dealing with aspects such as JavaScript technology, nonstandard session maintenance mechanisms, client redirections, pop-up menus, etc. We use current browser APIs as building blocks and leverage them to implement novel crawling models and algorithms

acm symposium on applied computing | 2005

Automatic wrapper maintenance for semi-structured web sources using results from previous queries

Juan Raposo; Alberto Pan; Manuel Álvarez; Ángel Viña

During the last years, significant attention has been paid to the problem of building wrappers for extracting data from semistructured web sources. Nevertheless, since web sources are autonomous, they may experience changes that invalidate the wrappers. In this paper, we present new heuristics and algorithms to address the problem of automatic wrapper maintenance. Our approach is based on collecting query results during wrapper operation and using them later to generate new sets of examples that can be used to induce a new wrapper when the source changes.

Explore More