Xixian Han
Harbin Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xixian Han.
IEEE Transactions on Knowledge and Data Engineering | 2013
Xixian Han; Donghua Yang; Jinbao Wang
Skyline is an important operation in many applications to return a set of interesting points from a potentially huge data space. Given a table, the operation finds all tuples that are not dominated by any other tuples. It is found that the existing algorithms cannot process skyline on big data efficiently. This paper presents a novel skyline algorithm SSPL on big data. SSPL utilizes sorted positional index lists which require low space overhead to reduce I/O cost significantly. The sorted positional index list Lj is constructed for each attribute Aj and is arranged in ascending order of Aj. SSPL consists of two phases. In phase 1, SSPL computes scan depth of the involved sorted positional index lists. During retrieving the lists in a round-robin fashion, SSPL performs pruning on any candidate positional index to discard the candidate whose corresponding tuple is not skyline result. Phase 1 ends when there is a candidate positional index seen in all of the involved lists. In phase 2, SSPL exploits the obtained candidate positional indexes to get skyline results by a selective and sequential scan on the table. The experimental results on synthetic and real data sets show that SSPL has a significant advantage over the existing skyline algorithms.
Knowledge and Information Systems | 2015
Xixian Han; Hong Gao
In many applications including multicriteria decision making, top-k dominating query is a practically useful tool to return k tuples with the highest domination scores in a potentially huge data space. The existing algorithms, either requiring indexes built on the specific attribute subset, or incurring high I/O cost or memory cost, cannot process top-k dominating query on massive data efficiently. In this paper, a novel algorithm TDEP is proposed to utilize sorted lists built for each attribute with low cost to return top-k dominating result on massive data efficiently. Through analysis, it is found that TDEP can be divided into two phases: growing phase and shrinking phase. In each phase, TDEP retrieves the sorted lists in round-robin fashion and maintains the candidates until the stop condition is satisfied. The theoretical analysis is provided for the execution behavior in two phases. An efficient method is developed to compute the domination scores of tuples with the obtained candidates only. Besides, TDEP adopts early pruning to reduce the number of candidate tuples maintained significantly. The extensive experimental results, conducted on synthetic and real-life data sets, show the significant performance advantage of TDEP over the existing algorithms.
Future Generation Computer Systems | 2013
Donghua Yang; Yuqiang Feng; Ye Yuan; Xixian Han; Jinbao Wang
Ad-hoc Aggregate query is extremely important for query intensive applications in cloud computing which extracts valuable summary information on massive datasets to help the decision-maker make right decisions. Current data storage schemes (row-store and column-store) cannot efficiently answer ad-hoc aggregate query on massive data sets in cloud computing. A new data storage structure (bit vector storage structure, bit-store for short) is proposed in this paper. The paper focuses on proposing ad-hoc aggregate query algorithms based on bit-store. Firstly, the storage model of bit-store including its attribute encoding schemes and bit file organization is introduced. Secondly, different aggregate operations for query processing are presented based on different encoding schemes. Thirdly, cost analysis for different aggregate operations is presented. Finally, the effectiveness and efficiency of the proposed algorithms is showed by the analytical and experimental results.
Knowledge and Information Systems | 2012
Xixian Han; Donghua Yang
The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms.
Information Sciences | 2014
Xixian Han; Hong Gao
Abstract Join aggregation is an important operation in database systems to return aggregate information on the join of two or several tables. Compared with exact query, it is a better choice in many cases to return approximate result satisfying a user-specified confidence interval in a much faster response time. It is found that none of previous works can efficiently process approximate join aggregation on massive data with arbitrary accuracy. This paper proposes a novel algorithm p e -AJA ( ( p , e ) -Approximate Join Aggregation) to obtain approximate join aggregate result with arbitrary confidence interval efficiently. Two data structures of low space overhead, JRS and JPIPT, are presented in this paper. p e -AJA first makes use of JRS to return a quick response. If the approximate result computed by JRS does not satisfy the given confidence interval, JPIPT is exploited to obtain enough random join tuples. This paper presents a novel sampling algorithm to acquire random JPIPT tuples of specified size and devises its correctness proof. A tuple fetching method is proposed to retrieve join tuples by the sampled JPIPT tuples in one-pass sequential scan on joined tables. The construction and maintenance algorithms of JPIPT and JRS are provided also in this paper. The experimental results show that p e -AJA obtains 3 times to 2 orders of magnitude speedup over the existing algorithms and runs 1 to 4 orders of magnitude faster than exact query.
IEEE Transactions on Knowledge and Data Engineering | 2015
Xixian Han; Hong Gao
In many applications, top- k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top- k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top- k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top- k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed in this paper. The extensive experimental results, conducted on synthetic and real-life data sets, show that T2S has a significant advantage over the existing algorithms.
Information Sciences | 2013
Xixian Han; Jinbao Wang; Donghua Yang
In many applications, top-k join is an important operation to return the k most important join tuples among the potentially huge answer space according to a given ranking function. PBRJ is an algorithm template that generalizes previous top-k join algorithms. In this paper, our analysis shows that PBRJ needs to maintain a large quantity of candidate tuples on massive data. Based on the analysis, this paper proposes a novel top-k join algorithm TJJE which is suitable for handling massive data. By some pre-computed information, TJJE first estimates an upper-bound on scan depth of each joined table. Then it determines the file that contains the join positional index pairs of the top-k join results. A novel algorithm is proposed to retrieve the required join tuples by a single sequential and selective scan on the joined tables. Finally, the top-k join results are obtained by a single scan on the retrieved join tuples. The correctness proof and cost analysis of TJJE are presented in this paper. Extensive experiments show that TJJE maintains up to three orders of magnitude fewer candidate tuples and obtains up to one order of magnitude speedup compared to PBRJ.
Knowledge and Information Systems | 2016
Xixian Han; Xianmin Liu; Hong Gao
In many applications, top-k query is an important operation to return a set of interesting points in a potentially huge data space. The existing algorithms, either maintaining too many candidates, or requiring assistant structures built on the specific attribute subset, or returning results with probabilistic guarantee, cannot process top-k query on massive data efficiently. This paper proposes a sorted-list-based TKAP algorithm, which utilizes some data structures of low space overhead, to efficiently compute top-k results on massive data. In round-robin retrieval on sorted lists, TKAP performs adaptive pruning operation and maintains the required candidates until the stop condition is satisfied. The adaptive pruning operation can be adjusted by the information obtained in round-robin retrieval to achieve a better pruning effect. The adaptive pruning rule is developed in this paper, along with its theoretical analysis. The extensive experimental results, conducted on synthetic and real-life data sets, show the significant advantage of TKAP over the existing algorithms.
Knowledge and Information Systems | 2015
Xixian Han; Hong Gao; Chengyu Yang
Skyline join is an important operation in many applications to return all join tuples that are not dominated by any other join tuples. It is found that the existing algorithms cannot process skyline join on massive data efficiently. This paper presents a novel skyline join algorithm SEPT on massive data. SEPT utilizes sorted positional index lists with join information which require low space overhead to reduce I/O cost significantly. The sorted positional index list is constructed for each potential skyline attribute in the joined tables and is arranged in ascending order of the attribute. SEPT consists of two phases. In phase one, SEPT obtains candidate join positional index pairs of skyline join results. During retrieving the sorted positional index lists, SEPT performs pruning on candidate join positional index pairs in order to discard the candidates whose corresponding join tuples are not skyline join results. In phase two, SEPT exploits the obtained candidate join positional index pairs to get skyline join results by a selective and sequential scan on the tables. The experimental results on synthetic and real data sets show that SEPT has a significant advantage over the existing skyline join algorithms.
conference on information and knowledge management | 2017
Kaiqi Zhang; Hong Gao; Xixian Han; Zhipeng Cai
The skyline query is important in database community. In recent years, the researches on incomplete data have been increasingly considered, especially for the skyline query. However, the existing skyline definition on incomplete data cannot provide users with valuable references. In this paper, we propose a novel skyline definition utilizing probabilistic model on incomplete data where each point has a probability to be in the skyline. In particular, it returnsK points with the highest skyline probabilities. Meanwhile, it is a big challenge to compute probabilistic skyline on incomplete data. We propose an efficient algorithm PISkyline, which utilizes two pruning strategies to reduce the number of points and adopts two optimizations to accelerate probability computation for each point. Nevertheless, PISkyline is susceptible to the order of input data and there is still a great deal of room for optimization. We develop a point-level sorting technique by adjusting the order of accessing points to further improve the efficiency of PISkyline. Our experimental results demonstrate that our algorithms are tens of times faster than the naive algorithm on both synthetic and real datasets.