Is this you? Create Your Porfile

Boduo Li

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Boduo Li is active.

Explore More

Publication

Featured researches published by Boduo Li.

international conference on management of data | 2011

A platform for scalable one-pass analytics using MapReduce

Boduo Li; Edward Mazur; Yanlei Diao; Andrew McGregor; Prashant J. Shenoy

Todays one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

international conference on management of data | 2010

PODS: a new model and processing algorithms for uncertain data streams

Thanh T. L. Tran; Liping Peng; Boduo Li; Yanlei Diao; Anna Liu

Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the PODS system that supports stream processing for uncertain data naturally captured using continuous random variables. PODS employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for complex relational operators, i.e., aggregates and joins, by exploring advanced statistical theory and approximation. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and significantly outperform a state-of-the-art sampling method. A case study further shows that our techniques can enable a tornado detection system (for the first time) to produce detection results at stream speed and with much improved quality.

ACM Transactions on Database Systems | 2012

SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

Boduo Li; Edward Mazur; Yanlei Diao; Andrew McGregor; Prashant J. Shenoy

Today’s one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the dataset to be fully loaded into the cluster before running analytical queries. This article examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Towards Scalable One-Pass Analytics Using MapReduce

Edward Mazur; Boduo Li; Yanlei Diao; Prashant J. Shenoy

An integral part of many data-intensive applications is the need to collect and analyze enormous datasets efficiently. Concurrent with such application needs is the increasing adoption of MapReduce as a programming model for processing large datasets using a cluster of machines. Current MapReduce systems, however, require the data set to be loaded into the cluster before running analytical queries, and thereby incur high delays to start query processing. Furthermore, existing systems are geared towards batch processing. In this paper, we seek to answer a fundamental question: what architectural changes are necessary to bring the benefits of the MapReduce computation model to incremental, one-pass analytics, i.e., to support stream processing and online aggregation? To answer this question, we first conduct a detailed empirical performance study of current MapReduce implementations including Hadoop and MapReduce Online using a variety of workloads. By doing so, we identify several drawbacks of existing systems for one-pass analytics. Based on the insights from our study, we conclude by listing key design requirements and arguing for architectural changes of MapReduce systems to overcome their current limitations and fully embrace incremental one-pass analytics and showing promising preliminary results.

embedded and real-time computing systems and applications | 2010

Exploiting the Interplay between Memory and Flash Storage in Embedded Sensor Devices

Devesh Agrawal; Boduo Li; Zhao Cao; Deepak Ganesan; Yanlei Diao; Prashant J. Shenoy

Although memory is an important constraint in embedded sensor nodes, existing embedded applications and systems are typically designed to work under the memory constraints of a single platform and do not consider the interplay between memory and flash storage. In this paper, we present the design of a memory-adaptive flash-based embedded sensor system that allows an application to exploit the presence of flash and adapt to different amounts of RAM on the embedded device. We describe how such a system can be exploited by data-centric sensor applications. Our design involves several novel features: flash and memory-efficient storage and indexing, techniques for efficient storage reclamation, and intelligent buffer management to maximize write coalescing. Our results show that our system is highly energy-efficient under different workloads, and can be configured for embedded sensor platforms with memory constraints ranging from a few kilobytes to hundreds of kilobytes.

wireless algorithms systems and applications | 2008

Reliable and Fast Detection of Gradual Events in Wireless Sensor Networks

Liping Peng; Hong Gao; Shengfei Shi; Boduo Li

Event detection is among the most important applications of wireless sensor networks. Due to the fact that sensor readings do not always represent the true attribute values, previous literatures suggested threshold-based voting mechanism which involves collecting votes of all neighbors to disambiguate node failures from events, instead of reporting an event directly based on the judgement of single sensor node. Although such mechanism significantly reduces false positives, it inevitably introduces false negatives which lead to a detection delay under the scenario of gradual events. In this paper, we propose a new detection method --- the bit-string match voting (BMV), which provides a response time close to that of the direct reporting method and a false positive rate even lower than that of the threshold-based voting method. Furthermore, BMV is able to avoid repeated and redundant reports of the same event, thus prolongs the life of the network. Extensive simulations are given to demonstrate and verify the advantages of BMV.

international conference on data engineering | 2009

iVA-File: Efficiently Indexing Sparse Wide Tables in Community Systems

Boduo Li; Mei Hui; Hong Gao

In community web management systems (CWMS), storage structures inspired by universal tables are being used increasingly to manage sparse datasets. Such a sparse wide table (SWT) typically embodies thousands of attributes, with many of them being undefined in each tuple, and low-dimensional structured similarity search on a combination of numerical and text attributes is a common operation. However, many properties of such wide tables and their associated Web 2.0 services render most multi-dimensional indexing structures irrelevant. Recent studies in this area have mainly focused on improving the storage efficiency and efficient deployment of inverted indices; so far no new index has been proposed for indexing SWTs. The inverted index is fast for scanning but not efficient in reducing random accesses to the data file as it captures little information about the content of attribute values. In this paper, we propose the iVA-file that works on the basis of approximate contents and keeps scanning efficiency within a bounded range. We introduce the nG-signature to approximately represent data strings and improve the existing approximate vectors for numerical values. We also propose an efficient query processing strategy for the iVA-file, which is different from strategies used for existing scan-based indices. To enable the use of different metrics of distance between a query and a tuple that may vary from application to application, the iVA-file has been designed to be metric-oblivious and to provide efficient filter-and-refine search based on any rational metric. Extensive experiments on real datasets show that the iVA-file outperforms existing proposals in query efficiency significantly, at the same time, keeps a good update speed.

conference on innovative data systems research | 2009