Denilson Barbosa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Denilson Barbosa is active.

Explore More

Publication

Featured researches published by Denilson Barbosa.

international world wide web conferences | 2003

The XML web: a first study

Laurent Mignet; Denilson Barbosa; Pierangelo Veltri

Although originally designed for large-scale electronic publishing, XML plays an increasingly important role in the exchange of data on the Web. In fact, it is expected that XML will become the lingua franca of the Web, eventually replacing HTML. Not surprisingly, there has been a great deal of interest on XML both in industry and in academia. Nevertheless, to date no comprehensive study on the XML Web (i.e., the subset of the Web made of XML documents only) nor on its contents has been made. This paper is the first attempt at describing the XML Web and the documents contained in it. Our results are drawn from a sample of a repository of the publicly available XML documents on the Web, consisting of about 200,000 documents. Our results show that, despite its short history, XML already permeates the Web, both in terms of generic domains and geographically. Also, our results about the contents of the XML Web provide valuable input for the design of algorithms, tools and systems that use XML in one form or another.

international conference on management of data | 2002

ToXgene: a template-based data generator for XML

Denilson Barbosa; Alberto O. Mendelzon; John Keenleyside; Kelly A. Lyons

Synthetic collections of XML documents can be useful in many applications, such as benchmarking (e.g., Xmark [4], XOO7 [2]) and algorithm testing and evaluation. We present ToXgene, a template-based tool for facilitating the generation of large, consistent collections of synthetic XML documents. ToXgene was designed with the following requirements in mind: it should be declarative, to speed the data generation up; it should be general enough to generate fairly complex XML content and it should be powerful enough to capture the most common kinds of constraints in popular benchmarks. Preliminary experimental results show that our tool can closely reproduce the data sets for the Xmark and the TPC-H benchmarks [6]. The ToXgene Template Specification Language (TSL) is a subset of the XML Schema notation augmented with annotations for specifying certain properties of the intended data, such as probability distributions, the vocabulary for CDATA content, etc. We use XML Schema as the basis for TSL not only because it is a W3C standard, but also because it provides a more detailed description of XML documents than DTDs; in particular, it allows the specification of datatypes. We note that our tool gives the user total control over the data to be generated; thus, it is intended for the cases when the user knows the structure of the data she wants and requires the data to conform to this structure (however, we note that the structure does not have to be regular). The main features of our tool are:

international world wide web conferences | 2005

Studying the XML Web: Gathering Statistics from an XML Sample

Denilson Barbosa; Laurent Mignet; Pierangelo Veltri

XML has emerged as the language for exchanging data on the web and has attracted considerable interest both in industry and in academia. Nevertheless, to date, little is known about the XML documents published on the web. This paper presents a comprehensive analysis of a sample of about 200,000 XML documents on the web, and is the first study of its kind. We study the distribution of XML documents across the web in several ways; moreover, we provided a detailed characterization of the structure of real XML documents. Our results provide valuable input to the design of algorithms, tools and systems that use XML in one form or another.

international conference on data engineering | 2004

Efficient incremental validation of XML documents

Denilson Barbosa; Alberto O. Mendelzon; Leonid Libkin; Laurent Mignet; Marcelo Arenas

We discuss incremental validation of XML documents with respect to DTDs and XML schema definitions. We consider insertions and deletions of subtrees, as opposed to leaf nodes only, and we also consider the validation of ID and IDREF attributes. For arbitrary schemas, we give a worst-case n log n time and linear space algorithm, and show that it often is far superior to revalidation from scratch. We present two classes of schemas, which capture most real-life DTDs, and show that they admit a logarithmic time incremental validation algorithm that, in many cases, requires only constant auxiliary space. We then discuss an implementation of these algorithms that is independent of, and can be customized for different storage mechanisms for XML. Finally, we present extensive experimental results showing that our approach is highly efficient and scalable.

international symposium on wikis and open collaboration | 2012

Identifying controversial articles in Wikipedia: a comparative study

Hoda Sepehri Rad; Denilson Barbosa

Wikipedia articles are the result of the collaborative editing of a diverse group of anonymous volunteer editors, who are passionate and knowledgeable about specific topics. One can argue that this plurality of perspectives leads to broader coverage of the topic, thus benefitting the reader. On the other hand, differences among editors on polarizing topics can lead to controversial or questionable content, where facts and arguments are presented and discussed to support a particular point of view. Controversial articles are manually tagged by Wikipedia editors, and span many interesting and popular topics, such as religion, history, and politics, to name a few. Recent works have been proposed on automatically identifying controversy within unmarked articles. However, to date, no systematic comparison of these efforts has been made. This is in part because the various methods are evaluated using different criteria and on different sets of articles by different authors, making it hard for anyone to verify the efficacy and compare all alternatives. We provide a first attempt at bridging this gap. We compare five different methods for modelling and identifying controversy, and discuss some of the unique difficulties and opportunities inherent to the way Wikipedia is produced.

conference on information and knowledge management | 2014

Robust Entity Linking via Random Walks

Zhaochen Guo; Denilson Barbosa

Entity Linking is the task of assigning entities from a Knowledge Base to textual mentions of such entities in a document. State-of-the-art approaches rely on lexical and statistical features which are abundant for popular entities but sparse for unpopular ones, resulting in a clear bias towards popular entities and poor accuracy for less popular ones. In this work, we present a novel approach that is guided by a natural notion of semantic similarity which is less amenable to such bias. We adopt a unified semantic representation for entities and documents - the probability distribution obtained from a random walk on a subgraph of the knowledge base - which can overcome the feature sparsity issue that affects previous work. Our algorithm continuously updates the semantic signature of the document as mentions are disambiguated, thus focusing the search based on context. Our experimental evaluation uses well-known benchmarks and different samples of a Wikipedia-based benchmark with varying entity popularity; the results illustrate well the bias of previous methods and the superiority of our approach, especially for the less popular entities.

international conference on data engineering | 2010

TASM: Top-k Approximate Subtree Matching

Nikolaus Augsten; Denilson Barbosa; Michael H. Böhlen; Themis Palpanas

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

international xml database symposium | 2004

Information Preservation in XML-to-Relational Mappings

Denilson Barbosa; Juliana Freire; Alberto O. Mendelzon

We study the problem of storing XML documents using relational mappings. We propose a formalization of classes of mapping schemes based on the languages used for defining functions that assign relational databases to XML documents and vice-versa. We also discuss notions of information preservation for mapping schemes; we define lossless mapping schemes as those that preserve the structure and content of the documents, and validating mapping schemes as those in which valid documents can be mapped into legal databases, and all legal databases are (equivalent to) mappings of valid documents. We define one natural class of mapping schemes that captures all mappings in the literature, and show negative results for testing whether such mappings are lossless or validating. Finally, we propose a lossless and validating mapping scheme, and show that it performs well in the presence of updates.

acm conference on hypertext | 2012

Leveraging editor collaboration patterns in wikipedia

Hoda Sepehri Rad; Aibek Makazhanov; Davood Rafiei; Denilson Barbosa

Predicting the positive or negative attitude of individuals towards each other in a social environment has long been of interest, with applications in many domains. We investigate this problem in the context of the collaborative editing of articles in Wikipedia, showing that there is enough information in the edit history of the articles that can be utilized for predicting the attitude of co-editors. We train a model using a distant supervision approach, by labeling interactions between editors as positive or negative depending on how these editors vote for each other in Wikipedia admin elections. We use the model to predict the attitude among other editors, who have neither run nor voted in an election. We validate our model by assessing its accuracy in the tasks of predicting the results of the actual elections, and identifying controversial articles. Our analysis reveals that the interactions in co-editing articles can accurately predict votes, although there are differences between positive and negative votes. For instance, the accuracy when predicting negative votes substantially increases by considering longer traces of the edit history. As for predicting controversial articles, we show that exploiting positive and negative interactions during the production of an article provides substantial improvements on previous attempts at detecting controversial articles in Wikipedia.

international world wide web conferences | 2007

Adaptive record extraction from web pages

Justin Park; Denilson Barbosa

We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns.We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records (such as entrieson blogs).

Explore More