Is this you? Create Your Porfile

Wuman Luo

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wuman Luo is active.

Explore More

Publication

Featured researches published by Wuman Luo.

international conference on management of data | 2013

Finding time period-based most frequent path in big trajectory data

Wuman Luo; Haoyu Tan; Lei Chen; Lionel Man Shuan Ni

The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. In this paper, we study a new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data. We refer to this query as time period-based MFP (TPMFP). Specifically, given a time period T, a source v_s and a destination v_d, TPMFP searches the MFP from v_s to v_d during T. Though there exist several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect peoples common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give a TPMFP definition that satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP. To demonstrate both the effectiveness and the efficiency of our approach, we conduct extensive experiments using a real dataset containing over 11 million trajectories.

Frontiers of Computer Science in China | 2014

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin He; Haoyu Tan; Wuman Luo; Shengzhong Feng; Jianping Fan

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

conference on information and knowledge management | 2012

CloST: a hadoop-based storage system for big spatio-temporal data analytics

Haoyu Tan; Wuman Luo; Lionel M. Ni

During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.

international conference on parallel and distributed systems | 2010

Data Vitalization: A New Paradigm for Large-Scale Dataset Analysis

Zhang Xiong; Wuman Luo; Lei Chen; Lionel M. Ni

Nowadays, datasets grow enormously both in size and complexity. One of the key issues confronted by large-scale dataset analysis is how to adapt systems to new, unprecedented query loads. Existing systems nail down the data organization scheme once and for all at the beginning of the system design, thus inevitably will see the performance goes down when user requirements change. In this paper, we propose a new paradigm, Data Vitalization, for large-scale dataset analysis. Our goal is to enable high flexibility such that the system is adaptive to complex analytical applications. Specifically, data are organized into a group of vitalized cells, each of which is a collection of data coupled with computing power. As user requirements change over time, cells evolve spontaneously to meet the potential new query loads. Besides basic functionality of Data Vitalization, we also explore an envisioned architecture of Data Vitalization including possible approaches for query processing, data evolution, as well as its tight-coupled mechanism for data storage and computing.

mobile data management | 2012

Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce

Wuman Luo; Haoyu Tan; Huajian Mao; Lionel M. Ni

High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management. Nowadays, performing HDSJs efficiently faces two challenges. First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must. Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse. In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scale up of our algorithms.

international conference on distributed computing systems | 2014

Exploring the Use of Diverse Replicas for Big Location Tracking Data

Ye Ding; Haoyu Tan; Wuman Luo; Lionel M. Ni

The value of large amount of location tracking data has received wide attention in many applications including human behavior analysis, urban transportation planning, and various location-based services (LBS). Nowadays, both scientific and industrial communities are encouraged to collect as much location tracking data as possible, which brings about two issues: 1) it is challenging to process the queries on big location tracking data efficiently, and 2) it is expensive to store several exact data replicas for fault-tolerance. So far, several dedicated storage systems have been proposed to address these issues. However, they do not work well when the query ranges vary widely. In this paper, we present the design of a storage system using diverse replica scheme which improves the query processing efficiency with reduced cost of storage space. To the best of our knowledge, we are the first to investigate the data storage and processing in the context of big location tracking data. Specifically, we conduct in-depth theoretical and empirical analysis of the trade-offs between different spatio-temporal partitioning schemes as well as data encoding schemes. Then we propose an effective approach to select an appropriate set of diverse replicas, which is optimized for the expected query loads while conforming to the given storage space budget. The experiment results confirm that using diverse replicas can significantly improve the overall query performance. The results also demonstrate that the proposed algorithms for the replica selection problem is both effective and efficient.

database systems for advanced applications | 2014

Inferring Road Type in Crowdsourced Map Services

Ye Ding; Jiangchuan Zheng; Haoyu Tan; Wuman Luo; Lionel M. Ni

In crowdsourced map services, digital maps are created and updated manually by volunteered users. Existing service providers usually provide users with a feature-rich map editor to add, drop, and modify roads. To make the map data more useful for widely-used applications such as navigation systems and travel planning services, it is important to provide not only the topology of the road network and the shapes of the roads, but also the types of each road segment (e.g., highway, regular road, secondary way, etc.). To reduce the cost of manual map editing, it is desirable to generate proper recommendations for users to choose from or conduct further modifications. There are several recent works aimed at generating road shapes from large number of historical trajectories; while to the best of our knowledge, none of the existing works have addressed the problem of inferring road types from historical trajectories. In this paper, we propose a model-based approach to infer road types from taxis trajectories. We use a combined inference method based on stacked generalization, taking into account both the topology of the road network and the historical trajectories. The experiment results show that our approach can generate quality recommendations of road types for users to choose from.

mobile data management | 2012

On Packing Very Large R-trees

Haoyu Tan; Wuman Luo; Huajian Mao; Lionel M. Ni

Many emerging mobile applications require analyzing large spatial datasets. In these applications, efficient query processing relies on spatial access methods such as R-trees. For datasets that are fairly static, R-trees are often built as a data loading process using packing techniques. However, traditional R-tree packing algorithms can only run on a single machine and thereby cannot scale to very large datasets. In this paper, we design and implement a general framework for parallel Rtree packing using MapReduce. This framework sequentially packs each R-tree level from bottom up. For lower levels that have a large number of rectangles, we propose a partition based algorithm for parallel packing. We also discuss two spatial partitioning methods that can efficiently handle heavily skewed datasets. To evaluate the performance, we conducted extensive experiments using large real datasets. The size of the datasets is up to 100GB and the number of spatial objects is up to 2 billion. Besides range queries, k-nearest neighbor searches and spatial joins are also used for evaluation. To the best of our knowledge, it is the first work that evaluates the query performance of packed R-trees on such large datasets with spatial queries other than range queries. The results confirm the scalability of our proposed framework and parallel packing algorithms. It is also shown that our packed R-trees have good query performance and optimal space utilization.

knowledge discovery and data mining | 2012

Exploration of ground truth from raw GPS data

Huajian Mao; Wuman Luo; Haoyu Tan; Lionel M. Ni; Nong Xiao

To enable smart transportation, a large volume of vehicular GPS trajectory data has been collected in the metropolitan-scale Shanghai Grid project. The collected raw GPS data, however, suffers from various errors. Thus, it is inappropriate to use the raw GPS dataset directly for many potential smart transportation applications. Map matching, a process to align the raw GPS data onto the corresponding road network, is a commonly used technique to calibrate the raw GPS data. In practice, however, there is no ground truth data to validate the calibrated GPS data. It is necessary and desirable to have ground truth data to evaluate the effectiveness of various map matching algorithms, especially in complex environments. In this paper, we propose truthFinder, an interactive map matching system for ground truth data exploration. It incorporates traditional map matching algorithms and human intelligence in a unified manner. The accuracy of truthFinder is guaranteed by the observation that a vehicular trajectory can be correctly identified by human-labeling with the help of a period of historical GPS dataset. To the best of our knowledge, truthFinder is the first interactive map matching system trying to explore the ground truth from historical GPS trajectory data. To measure the cost of human interactions, we design a cost model that classifies and quantifies user operations. Having the guaranteed accuracy, truthFinder is evaluated in terms of operation cost. The results show that truthFinder makes the cost of map matching process up to two orders of magnitude less than the pure human-labeling approach.

Wireless Communications and Mobile Computing | 2018

Diverse Mobile System for Location-Based Mobile Data

Qing Liao; Haoyu Tan; Wuman Luo; Ye Ding

The value of large amount of location-based mobile data has received wide attention in many research fields including human behavior analysis, urban transportation planning, and various location-based services. Nowadays, both scientific and industrial communities are encouraged to collect as much location-based mobile data as possible, which brings two challenges: how to efficiently process the queries of big location-based mobile data and how to reduce the cost of storage services, because it is too expensive to store several exact data replicas for fault-tolerance. So far, several dedicated storage systems have been proposed to address these issues. However, they do not work well when the ranges of queries vary widely. In this work, we design a storage system based on diverse replica scheme which not only can improve the query processing efficiency but also can reduce the cost of storage space. To the best of our knowledge, this is the first work to investigate the data storage and processing in the context of big location-based mobile data. Specifically, we conduct in-depth theoretical and empirical analysis of the trade-offs between different spatial-temporal partitioning and data encoding schemes. Moreover, we propose an effective approach to select an appropriate set of diverse replicas, which is optimized for the expected query loads while conforming to the given storage space budget. The experiment results show that using diverse replicas can significantly improve the overall query performance and the proposed algorithms for the replica selection problem are both effective and efficient.

Explore More