Thanh T. L. Tran
University of Massachusetts Amherst
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thanh T. L. Tran.
international conference on data engineering | 2009
Thanh T. L. Tran; Charles A. Sutton; Richard Cocci; Yanming Nie; Yanlei Diao; Prashant J. Shenoy
Recent innovations in RFID technology are enabling large-scale cost-effective deployments in retail, healthcare, pharmaceuticals and supply chain management. The advent of mobile or handheld readers adds significant new challenges to RFID stream processing due to the inherent reader mobility, increased noise, and incomplete data. In this paper, we address the problem of translating noisy, incomplete raw streams from mobile RFID readers into clean, precise event streams with location information. Specifically we propose a probabilistic model to capture the mobility of the reader, object dynamics, and noisy readings. Our model can self-calibrate by automatically estimating key parameters from observed data. Based on this model, we employ a sampling-based technique called particle filtering to infer clean, precise information about object locations from raw streams from mobile RFID readers. Since inference based on standard particle filtering is neither scalable nor efficient in our settings, we propose three enhancements---particle factorization, spatial indexing, and belief compression---for scalable inference over large numbers of objects and high-volume streams. Our experiments show that our approach can offer 49\% error reduction over a state-of-the-art data cleaning approach such as SMURF while also being scalable and efficient.
international conference on database theory | 2012
Graham Cormode; Cecilia M. Procopiuc; Divesh Srivastava; Thanh T. L. Tran
Differential privacy is fast becoming the method of choice for releasing data under strong privacy guarantees. A standard mechanism is to add noise to the counts in contingency tables derived from the dataset. However, when the dataset is sparse in its underlying domain, this vastly increases the size of the published data, to the point of making the mechanism infeasible. We propose a general framework to overcome this problem. Our approach releases a compact summary of the noisy data with the same privacy guarantee and with similar utility. Our main result is an efficient method for computing the summary directly from the input data, without materializing the vast noisy data. We instantiate this general framework for several summarization methods. Our experiments show that this is a highly practical solution: The summaries are up to 1000 times smaller, and can be computed in less than 1% of the time compared to standard methods. Finally, our framework works with various data transformations, such as wavelets or sketches.
international conference on management of data | 2010
Thanh T. L. Tran; Liping Peng; Boduo Li; Yanlei Diao; Anna Liu
Uncertain data streams, where data is incomplete, imprecise, and even misleading, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the PODS system that supports stream processing for uncertain data naturally captured using continuous random variables. PODS employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for complex relational operators, i.e., aggregates and joins, by exploring advanced statistical theory and approximation. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements, and significantly outperform a state-of-the-art sampling method. A case study further shows that our techniques can enable a tornado detection system (for the first time) to produce detection results at stream speed and with much improved quality.
very large data bases | 2012
Thanh T. L. Tran; Liping Peng; Yanlei Diao; Andrew McGregor; Anna Liu
Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the claro system that supports stream processing for uncertain data naturally captured using continuous random variables. claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements and outperform state-of-the-art sampling methods.
international conference on data engineering | 2008
Richard Cocci; Thanh T. L. Tran; Yanlei Diao; Prashant J. Shenoy
Despite its promise, RFID technology presents numerous challenges, including incomplete data, lack of location and containment information, and very high volumes. In this work, we present a novel data interpretation and compression substrate over RFID streams to address these challenges in enterprise supply-chain environments. Our results show that our inference techniques provide good accuracy while retaining efficiency, and our compression algorithm yields significant reduction in data volume.
very large data bases | 2010
Thanh T. L. Tran; Andrew McGregor; Yanlei Diao; Liping Peng; Anna Liu
Uncertain data streams are increasingly common in real-world deployments and monitoring applications require the evaluation of complex queries on such streams. In this paper, we consider complex queries involving conditioning (e.g., selections and group bys) and aggregation operations on uncertain data streams. To characterize the uncertainty of answers to these queries, one generally has to compute the full probability distribution of each operation used in the query. Computing distributions of aggregates given conditioned tuple distributions is a hard, unsolved problem. Our work employs a new evaluation framework that includes a general data model, approximation metrics, and approximate representations. Within this framework we design fast data-stream algorithms, both deterministic and randomized, for returning approximate distributions with bounded errors as answers to those complex queries. Our experimental results demonstrate the accuracy and efficiency of our approximation techniques and offer insights into the strengths and limitations of deterministic and randomized algorithms.
very large data bases | 2013
Thanh T. L. Tran; Yanlei Diao; Charles A. Sutton; Anna Liu
Uncertain data management has become crucial in many sensing and scientific applications. As user-defined functions (UDFs) become widely used in these applications, an important task is to capture result uncertainty for queries that evaluate UDFs on uncertain data. In this work, we provide a general framework for supporting UDFs on uncertain data. Specifically, we propose a learning approach based on Gaussian processes (GPs) to compute approximate output distributions of a UDF when evaluated on uncertain input, with guaranteed error bounds. We also devise an online algorithm to compute such output distributions, which employs a suite of optimizations to improve accuracy and performance. Our evaluation using both real-world and synthetic functions shows that our proposed GP approach can outperform the state-of-the-art sampling approach with up to two orders of magnitude improvement for a variety of UDFs.
conference on innovative data systems research | 2009
Yanlei Diao; Boduo Li; Anna Liu; Liping Peng; Charles A. Sutton; Thanh T. L. Tran; Michael Zink
arXiv: Databases | 2011
Graham Cormode; Cecilia M. Procopiuc; Divesh Srivastava; Thanh T. L. Tran
Archive | 2009
Yanlei Diao; Boduo Li; Anna Liu; Liping Peng; Charles A. Sutton; Thanh T. L. Tran; Michael Zink