Shiyu Yang
University of New South Wales
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shiyu Yang.
very large data bases | 2015
Shiyu Yang; Muhammad Aamir Cheema; Xuemin Lin; Wei Wang
Given a set of users, a set of facilities and a query facility q, a reverse k nearest neighbors (RkNN) query returns every user u for which the query is one of its k closest facilities. RkNN queries have been extensively studied under a variety of settings and many sophisticated algorithms have been proposed to answer these queries. However, the existing experimental studies suffer from a few limitations. For example, some studies estimate the I/O cost by charging a fixed penalty per I/O and we show that this may be misleading. Also, the existing studies either use an extremely small buffer or no buffer at all which puts some algorithms at serious disadvantage. We show that the performance of these algorithms is significantly improved even when a small buffer (containing 100 pages) is used. Finally, in each of the existing studies, the proposed algorithm is mainly compared only with its predecessor assuming that it was the best algorithm at the time which is not necessarily true as shown in our experimental study. Motivated by these limitations, we present a comprehensive experimental study that addresses these limitations and compares some of the most notable algorithms under a wide variety of settings. Furthermore, we also present a carefully developed filtering strategy that significantly improves TPL which is one of the most popular RkNN algorithms. Specifically, the optimized version is up to 20 times faster than the original version and reduces its I/O cost up to two times.
international conference on data engineering | 2014
Shiyu Yang; Muhammad Aamir Cheema; Xuemin Lin; Ying Zhang
Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k-closest facilities. Due to its importance, RkNN query has received significant research attention in the past few years. Almost all of the existing techniques adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new RkNN algorithm called SLICE that utilizes the strength of regions-based pruning and overcomes its limitations. Our extensive experimental study on synthetic and real data sets demonstrate that SLICE is significantly more efficient than the existing algorithms. We also provide a detailed theoretical analysis to analyze various aspects of our algorithm such as I/O cost, the unpruned area, and the cost of its verification phase etc. The experimental study validates our theoretical analysis.
very large data bases | 2017
Shiyu Yang; Muhammad Aamir Cheema; Xuemin Lin; Ying Zhang; Wenjie Zhang
Given a set of facilities and a set of users, a reverse k nearest neighbors (RkNN) query q returns every user for which the query facility is one of the k closest facilities. Almost all of the existing techniques to answer RkNN queries adopt a pruning-and-verification framework. Regions-based pruning and half-space pruning are the two most notable pruning strategies. The half-space-based approach prunes a larger area and is generally believed to be superior. Influenced by this perception, almost all existing RkNN algorithms utilize and improve the half-space pruning strategy. We observe the weaknesses and strengths of both strategies and discover that the regions-based pruning has certain strengths that have not been exploited in the past. Motivated by this, we present a new regions-based pruning algorithm called Slice that utilizes the strength of regions-based pruning and overcomes its limitations. We also study spatial reverse top-k (SRTk) queries that return every user u for which the query facility is one of the top-k facilities according to a given linear scoring function. We first extend half-space-based pruning to answer SRTk queries. Then, we propose a novel regions-based pruning algorithm following Slice framework to solve the problem. Our extensive experimental study on synthetic and real data sets demonstrates that Slice is significantly more efficient than all existing RkNN and SRTk algorithms.
very large data bases | 2016
Longbin Lai; Lu Qin; Xuemin Lin; Ying Zhang; Lijun Chang; Shiyu Yang
Subgraph enumeration aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph. As the subgraph isomorphism operation is computationally intensive, researchers have recently focused on solving this problem in distributed environments, such as MapReduce and Pregel. Among them, the state-of-the-art algorithm, Twin TwigJoin, is proven to be instance optimal based on a left-deep join framework. However, it is still not scalable to large graphs because of the constraints in the left-deep join framework and that each decomposed component (join unit) must be a star. In this paper, we propose SEED - a scalable sub-graph enumeration approach in the distributed environment. Compared to Twin TwigJoin, SEED returns optimal solution in a generalized join framework without the constraints in Twin TwigJoin. We use both star and clique as the join units, and design an effective distributed graph storage mechanism to support such an extension. We develop a comprehensive cost model, that estimates the number of matches of any given pattern graph by considering power-law degree distribution in the data graph. We then generalize the left-deep join framework and develop a dynamic-programming algorithm to compute an optimal bushy join plan. We also consider overlaps among the join units. Finally, we propose clique compression to further improve the algorithm by reducing the number of the intermediate results. Extensive performance studies are conducted on several real graphs, one containing billions of edges. The results demonstrate that our algorithm outperforms all other state-of-the-art algorithms by more than one order of magnitude.
international conference on data engineering | 2017
Jianye Yang; Wenjie Zhang; Shiyu Yang; Ying Zhang; Xuemin Lin
In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R./S retrieves all record pairs f(r, s)g 2 R S such that r s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersectionoriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the unionaornidenttehde mcaenthdoiddatgeenpearairtses aare siogbntaatiunreed fobry etahceh urneicoonrdofin thRe inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, unionoriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous unionoriented methods but also integrates the goodness of intersectionoriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup.
very large data bases | 2018
Jianye Yang; Wenjie Zhang; Shiyu Yang; Ying Zhang; Xuemin Lin; Long Yuan
In this paper, we study the problem of set containment join. Given two collections
australasian database conference | 2017
Deming Chu; Zhitao Shen; Yu Zhang; Shiyu Yang; Xuemin Lin
australasian database conference | 2016
Xiang Wang; Shiyu Yang; Ying Zhang
\mathcal {R}
australasian database conference | 2016
Peihao Tong; Junjie Yao; Liping Wang; Shiyu Yang
The Computer Journal | 2015
Shiyu Yang; Muhammad Aamir Cheema; Xuemin Lin
R and