Luciano Barbosa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Luciano Barbosa is active.

Explore More

Publication

Featured researches published by Luciano Barbosa.

international world wide web conferences | 2007

Combining classifiers to identify online databases

Luciano Barbosa; Juliana Freire

We address the problem of identifying the domain of onlinedatabases. More precisely, given a set F of Web forms automaticallygathered by a focused crawler and an online databasedomain D, our goal is to select from F only the formsthat are entry points to databases in D. Having a set ofWebforms that serve as entry points to similar online databasesis a requirement for many applications and techniques thataim to extract and integrate hidden-Web information, suchas meta-searchers, online database directories, hidden-Webcrawlers, and form-schema matching and merging.We propose a new strategy that automatically and accuratelyclassifies online databases based on features that canbe easily extracted from Web forms. By judiciously partitioningthe space of form features, this strategy allows theuse of simpler classifiers that can be constructed using learningtechniques that are better suited for the features of eachpartition. Experiments using real Web data in a representativeset of domains show that the use of different classifiersleads to high accuracy, precision and recall. This indicatesthat our modular classifier composition provides an effectiveand scalable solution for classifying online databases.

international conference on data engineering | 2007

Organizing Hidden-Web Databases by Clustering Visible Web Documents

Luciano Barbosa; Juliana Freire; Altigran Soares da Silva

In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context - both within and in the neighborhood of forms - as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters - measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.

international joint conference on natural language processing | 2015

Learning Hybrid Representations to Retrieve Semantically Equivalent Questions

Cícero Nogueira dos Santos; Luciano Barbosa; Dasha Bogdanova; Bianca Zadrozny

Retrieving similar questions in online QA (2) BOW-CNN is more robust than the pure CNN for long texts.

very large data bases | 2015

Dexter: large-scale discovery and extraction of product specifications on the web

Disheng Qiu; Luciano Barbosa; Xin Luna Dong; Yanyan Shen; Divesh Srivastava

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, Dexter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46AM product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.

web information and data management | 2005

Looking at both the present and the past to efficiently update replicas of web content

Luciano Barbosa; Ana Carolina Salgado; Francisco de A. T. de Carvalho; Jacques Robin; Juliana Freire

Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes.Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important to devise efficient update schedules that adapt to the change rate of the pages and avoid visiting pages not modified since the last visit.In this paper, we propose a new approach that learns to predict the change behavior of Web pages based both on the static features and change history of pages, and refreshes the copies accordingly.Experiments using real-world data show that our technique leads to substantial performance improvements compared to previously proposed approaches.

conference on computational natural language learning | 2015

Detecting Semantically Equivalent Questions in Online User Forums

Dasha Bogdanova; Cícero Nogueira dos Santos; Luciano Barbosa; Bianca Zadrozny

Two questions asking the same thing could be too different in terms of vocabulary and syntactic structure, which makes identifying their semantic equivalence challenging. This study aims to detect semantically equivalent questions in online user forums. We perform an extensive number of experiments using data from two different Stack Exchange forums. We compare standard machine learning methods such as Support Vector Machines (SVM) with a convolutional neural network (CNN). The proposed CNN generates distributed vector representations for pairs of questions and scores them using a similarity metric. We evaluate in-domain word embeddings versus the ones trained with Wikipedia, estimate the impact of the training set size, and evaluate some aspects of domain adaptation. Our experimental results show that the convolutional neural network with in-domain word embeddings achieves high performance even with limited training data.

conference on information and knowledge management | 2008

Siphon++: a hidden-webcrawler for keyword-based interfaces

Karane Vieira; Luciano Barbosa; Juliana Freire; Altigran Soares da Silva

The hidden Web consists of data that is generally hidden behind form interfaces, and as such, it is out of reach for traditional search engines. With the goal of leveraging the high-quality information in this largely unexplored portion of the Web, in this paper, we propose a new strategy for automatically retrieving data hidden behind keyword-based form interfaces. Unlike previous approaches to this problem, our strategy adapts the query generation and selection by detecting features of the index. We describe an extensive experimental evaluation which shows that: our strategy is able to derive appropriate queries to obtain high coverage while, at the same time, avoiding the retrieval of redundant data; and it obtains higher coverage and is more efficient approaches that use a fixed strategy for query generation.

international conference on management of data | 2010

Creating and exploring web form repositories

Luciano Barbosa; Hoa Nguyen; Thanh Hoang Nguyen; Ramesh Pinnamaneni; Juliana Freire

We present DeepPeep (http://www.deeppeep.org), a new system for discovering, organizing and analyzing Web forms. DeepPeep allows users to explore the entry points to hidden-Web sites whose contents are out of reach for traditional search engines. Besides demonstrating important features of DeepPeep and describing the infrastructure we used to build the system, we will show how this infrastructure can be used to create form collections and form search engines for different domains. We also present the analysis component of DeepPeep which allows users to explore and visualize information in form repositories, helping them not only to better search and understand forms in different domains, but also to refine the form gathering process.

data integration in the life sciences | 2007

Automatically constructing a directory of molecular biology databases

Luciano Barbosa; Sumit Tandon; Juliana Freire

There has been an explosion in the volume of biology-related information that is available in online databases. But finding the right information can be challenging. Not only is this information spread over multiple sources, but often, it is hidden behind form interfaces of online databases. There are several ongoing efforts that aim to simplify the process of finding, integrating and exploring these data. However, existing approaches are not scalable, and require substantial manual input. Notable examples include the NCBI databases and the NAR database compilation. As an important step towards a scalable solution to this problem, we describe a new infrastructure that automates, to a large extent, the process of locating and organizing online databases. We show how this infrastructure can be used to automate the construction and maintenance of a Molecular Biology database collection. We also provide an evaluation which shows that the infrastructure is scalable and effective--it is able to efficiently locate and accurately identify the relevant online databases.

international conference on data mining | 2014

Bus Travel Time Predictions Using Additive Models

Matthias Kormaksson; Luciano Barbosa; Marcos R. Vieira; Bianca Zadrozny

Many factors can affect the predictability of public bus services such as traffic, weather, day of week, and hour of day. However, the exact nature of such relationships between travel times and predictor variables is, in most situations, not known. In this paper we develop a framework that allows for flexible modeling of bus travel times through the use of Additive Models. The proposed class of models provides a principled statistical framework that is highly flexible in terms of model building. The experimental results demonstrate uniformly superior performance of our best model as compared to previous prediction methods when applied to a very large GPS data set obtained from buses operating in the city of Rio de Janeiro.

Explore More