Is this you? Create Your Porfile

Haoyu Tan

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haoyu Tan is active.

Explore More

Publication

Featured researches published by Haoyu Tan.

international conference on management of data | 2013

Finding time period-based most frequent path in big trajectory data

Wuman Luo; Haoyu Tan; Lei Chen; Lionel Man Shuan Ni

The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. In this paper, we study a new path finding query which finds the most frequent path (MFP) during user-specified time periods in large-scale historical trajectory data. We refer to this query as time period-based MFP (TPMFP). Specifically, given a time period T, a source v_s and a destination v_d, TPMFP searches the MFP from v_s to v_d during T. Though there exist several proposals on defining MFP, they only consider a fixed time period. Most importantly, we find that none of them can well reflect peoples common sense notion which can be described by three key properties, namely suffix-optimal (i.e., any suffix of an MFP is also an MFP), length-insensitive (i.e., MFP should not favor shorter or longer paths), and bottleneck-free (i.e., MFP should not contain infrequent edges). The TPMFP with the above properties will reveal not only common routing preferences of the past travelers, but also take the time effectiveness into consideration. Therefore, our first task is to give a TPMFP definition that satisfies the above three properties. Then, given the comprehensive TPMFP definition, our next task is to find TPMFP over huge amount of trajectory data efficiently. Particularly, we propose efficient search algorithms together with novel indexes to speed up the processing of TPMFP. To demonstrate both the effectiveness and the efficiency of our approach, we conduct extensive experiments using a real dataset containing over 11 million trajectories.

Frontiers of Computer Science in China | 2014

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin He; Haoyu Tan; Wuman Luo; Shengzhong Feng; Jianping Fan

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

conference on information and knowledge management | 2012

CloST: a hadoop-based storage system for big spatio-temporal data analytics

Haoyu Tan; Wuman Luo; Lionel M. Ni

During the past decade, various GPS-equipped devices have generated a tremendous amount of data with time and location information, which we refer to as big spatio-temporal data. In this paper, we present the design and implementation of CloST, a scalable big spatio-temporal data storage system to support data analytics using Hadoop. The main objective of CloST is to avoid scan the whole dataset when a spatio-temporal range is given. To this end, we propose a novel data model which has special treatments on three core attributes including an object id, a location and a time. Based on this data model, CloST hierarchically partitions data using all core attributes which enables efficient parallel processing of spatio-temporal range scans. According to the data characteristics, we devise a compact storage structure which reduces the storage size by an order of magnitude. In addition, we proposes scalable bulk loading algorithms capable of incrementally adding new data into the system. We conduct our experiments using a very large GPS log dataset and the results show that CloST has fast data loading speed, desirable scalability in query processing, as well as high data compression ratio.

IEEE Transactions on Parallel and Distributed Systems | 2012

DDC: A Novel Scheme to Directly Decode the Collisions in UHF RFID Systems

Lei Kang; Kaishun Wu; Jin Zhang; Haoyu Tan; Lionel M. Ni

RFID has been gaining popularity due to its variety of applications, such as inventory control and localization. One important issue in RFID system is tag identification. In RFID systems, the tag randomly selects a slot to send a Random Number (RN) packet to contend for identification. Collision happens when multiple tags select the same slot, which makes the RN packet undecodable and thus reduces the channel utilization. In this paper, we redesign the RN pattern to make the collided RNs decodable. By leveraging the collision slots, the system performance can be dramatically enhanced. This novel scheme is called DDC, which is able to directly decode the collisions without exact knowledge of collided RNs. In the DDC scheme, we modify the RN generator in RFID tag and add a collision decoding scheme for RFID reader. We implement DDC in GNU Radio and USRP2 based testbed to verify its feasibility. Both theoretical analysis and testbed experiment show that DDC achieves 40 percent tag read rate gain compared with traditional RFID protocol.

international conference on computer communications | 2010

Chip Error Pattern Analysis in IEEE 802.15.4

Kaishun Wu; Haoyu Tan; Hoilun Ngan; Lionel M. Ni

IEEE 802.15.4 standard specifies physical layer (PHY) and medium access control (MAC) sublayer protocols for low-rate and low-power communication applications. In this protocol, every 4-bit symbol is encoded into a sequence of 32 chips that are actually transmitted over the air. The 32 chips as a whole is also called a pseudonoise code (PN-Code). Due to complex channel conditions such as attenuation and interference, the transmitted PN-Code will often be received with some PN-Code chips corrupted. In this paper, we conduct a systematic analysis on these errors occurring at chip level. We find that there are notable error patterns corresponding to different cases. We then show that recognizing these patterns enables us to identify the channel condition in great details. We believe that understanding what happened to the transmission in our way can potentially bring benefit to channel coding, routing, and error correction protocol design. Finally, we propose Simple Rule, a simple yet effective method based on the chip error patterns to infer the link condition with an accuracy of over 96 percent in our evaluations.

mobile data management | 2012

Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce

Wuman Luo; Haoyu Tan; Huajian Mao; Lionel M. Ni

High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management. Nowadays, performing HDSJs efficiently faces two challenges. First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must. Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse. In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scale up of our algorithms.

international conference on distributed computing systems | 2014

Exploring the Use of Diverse Replicas for Big Location Tracking Data

Ye Ding; Haoyu Tan; Wuman Luo; Lionel M. Ni

The value of large amount of location tracking data has received wide attention in many applications including human behavior analysis, urban transportation planning, and various location-based services (LBS). Nowadays, both scientific and industrial communities are encouraged to collect as much location tracking data as possible, which brings about two issues: 1) it is challenging to process the queries on big location tracking data efficiently, and 2) it is expensive to store several exact data replicas for fault-tolerance. So far, several dedicated storage systems have been proposed to address these issues. However, they do not work well when the query ranges vary widely. In this paper, we present the design of a storage system using diverse replica scheme which improves the query processing efficiency with reduced cost of storage space. To the best of our knowledge, we are the first to investigate the data storage and processing in the context of big location tracking data. Specifically, we conduct in-depth theoretical and empirical analysis of the trade-offs between different spatio-temporal partitioning schemes as well as data encoding schemes. Then we propose an effective approach to select an appropriate set of diverse replicas, which is optimized for the expected query loads while conforming to the given storage space budget. The experiment results confirm that using diverse replicas can significantly improve the overall query performance. The results also demonstrate that the proposed algorithms for the replica selection problem is both effective and efficient.

database systems for advanced applications | 2014

Inferring Road Type in Crowdsourced Map Services

Ye Ding; Jiangchuan Zheng; Haoyu Tan; Wuman Luo; Lionel M. Ni

In crowdsourced map services, digital maps are created and updated manually by volunteered users. Existing service providers usually provide users with a feature-rich map editor to add, drop, and modify roads. To make the map data more useful for widely-used applications such as navigation systems and travel planning services, it is important to provide not only the topology of the road network and the shapes of the roads, but also the types of each road segment (e.g., highway, regular road, secondary way, etc.). To reduce the cost of manual map editing, it is desirable to generate proper recommendations for users to choose from or conduct further modifications. There are several recent works aimed at generating road shapes from large number of historical trajectories; while to the best of our knowledge, none of the existing works have addressed the problem of inferring road types from historical trajectories. In this paper, we propose a model-based approach to infer road types from taxis trajectories. We use a combined inference method based on stacked generalization, taking into account both the topology of the road network and the historical trajectories. The experiment results show that our approach can generate quality recommendations of road types for users to choose from.

mobile data management | 2012

On Packing Very Large R-trees

Haoyu Tan; Wuman Luo; Huajian Mao; Lionel M. Ni

Many emerging mobile applications require analyzing large spatial datasets. In these applications, efficient query processing relies on spatial access methods such as R-trees. For datasets that are fairly static, R-trees are often built as a data loading process using packing techniques. However, traditional R-tree packing algorithms can only run on a single machine and thereby cannot scale to very large datasets. In this paper, we design and implement a general framework for parallel Rtree packing using MapReduce. This framework sequentially packs each R-tree level from bottom up. For lower levels that have a large number of rectangles, we propose a partition based algorithm for parallel packing. We also discuss two spatial partitioning methods that can efficiently handle heavily skewed datasets. To evaluate the performance, we conducted extensive experiments using large real datasets. The size of the datasets is up to 100GB and the number of spatial objects is up to 2 billion. Besides range queries, k-nearest neighbor searches and spatial joins are also used for evaluation. To the best of our knowledge, it is the first work that evaluates the query performance of packed R-trees on such large datasets with spatial queries other than range queries. The results confirm the scalability of our proposed framework and parallel packing algorithms. It is also shown that our packed R-trees have good query performance and optimal space utilization.

international conference on data mining | 2015

Dissecting Regional Weather-Traffic Sensitivity Throughout a City

Ye Ding; Yanhua Li; Ke Deng; Haoyu Tan; Mingxuan Yuan; Lionel M. Ni

The impact of inclement weather to urban traffic has been widely observed and studied for many years, with focus primarily on individual road segments by analyzing data from roadside deployed monitors. However, two fundamental questions are still open: (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes, (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drive the urban region traffic to be more vulnerable to weather changes. Answering these questions is unprecedentedly important for urban planners to understand the functional characteristics of various urban regions throughout a city, and to improve traffic prediction and learn the key factors in urban planning. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. In this work, we make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main components: WTI establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the WTI extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on WTI, which sheds light on future urban planning and reconstruction.

Explore More