Is this you? Create Your Porfile

Jizhou Luo

Harbin Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jizhou Luo is active.

Explore More

Publication

Featured researches published by Jizhou Luo.

very large data bases | 2008

Hash-base subgraph query processing method for graph-structured XML documents

Hongzhi Wang; Jizhou Luo; Hong Gao

When XML documents are modeled as graphs, many research issues arise. In particular, there are many new challenges in query processing on graph-structured XML documents because traditional query processing techniques for tree-structured XML documents cannot be directly applied. This paper studies the problem of structural queries on graph-structured XML documents. A hash-based structural join algorithm, HGJoin, is first proposed to handle reachability queries on graph-structured XML documents. Then, it is extended to the algorithms to process structural queries in form of bipartite graphs. Finally, based on these algorithms, a strategy to process subgraph queries in form of general DAGs is proposed. Analysis and experiments show that all the algorithms have high performance. It is notable that all the algorithms above can be slightly modified to process structural queries in form of general graphs.

international conference on information technology coding and computing | 2004

XCpaqs: compression of XML document with XPath query support

Hongzhi Wang; Jizhou Luo; Zhenying He

Information in XML format has obvious redundancy that wastes disk space, bandwidth and disk I/O when querying XML data. For the efficiency of storage and query XML, it is necessary to compress XML data. In this paper, XCpaqs, a compression technology of XML, is presented. XCpaqs separates XML document into structure and context information. At the same time, it keeps homomorphism relation between compressed and original XML document. XCpaqs encodes tag and path respectively. It makes parts of XPath query could be processed in main memory. XCpaqs can recognize data types and uses different encode strategy to compress data with different type. This feature makes the technology support XML documents without schema information. Therefore, XCpaqs is adaptive for XML warehouse, which stores XML documents gathered from internet with various schemas. The technology of query execution on XML data compressed by XCpaqs is also presented.

international conference for young computer scientists | 2008

Efficient Top-k Keyword Search on XML Streams

Lingli Li; Hongzhi Wang; Jizhou Luo

Keywords can be used to query XML data without schema information. In this paper, a novel kind of query is proposed, top-k keyword search over XML streams. According to the set of keywords and the number of results, such query can retrieve the top-k XML data fragments most related to the keyword set. A novel ranking strategy for search result is proposed to represent the relativity of XML segments and the query. In order to efficiently and effectively process the top-k keyword query on XML streams, based on this ranking strategy, a stack-based algorithm is proposed to dynamically obtain the top-k results with the highest ranks at any time, with a filtering method to delete redundant elements. Extensive experiments are performed to verify the effectiveness and efficiency of the algorithms presented in this paper.

international conference on management of data | 2007

InfiniteDB: a pc-cluster based parallel massive database management system

Hong Gao; Jizhou Luo; Shengfei Shi; Wei Zhang

This paper describes a PC-cluster based parallel DBMS, InfiniteDB, developed by the authors. InfiniteDB aims at efficiently storing and processing of massive databases in response to the rapidly growing in database size and the need of high performance analyzing of massive databases. It supports the parallelisms of intra-query, inter-query, intra-operation, inter-operation and pipelining. It provides effective strategies for processing massive databases including the multiple data declustering methods, the declustering-aware algorithms for the execution of relational operations and other database operations, and the adaptive query optimization method. It also provides the functions of parallel data warehousing and data mining, the coordinator-wrapper mechanism to support the integration of heterogeneous information resources on the Internet, and the fault tolerant and resilient infrastructures. It has been used in many applications and has proved quite effective for storing and processing massive databases in practice.

Journal of Zhejiang University Science C | 2017

FrepJoin: an efficient partition-based algorithm for edit similarity join

Jizhou Luo; Sheng-fei Shi; Hongzhi Wang

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

International Journal of Intelligent Information and Database Systems | 2008

Data sources selection for XML data sources

Hongzhi Wang; Jizhou Luo

In the information integration system, XML becomes an important format for information representation and exchanging. Selection of useful data sources for a query is a crucial problem for efficient query processing in an information integration system. This paper focuses on the data sources selection for XML data sources in the information integration system. For a query with both structural and value constraints, two kinds of indices, constraint index and structural index are presented for data sources selection. The former is grouped by values and captures the structure related to each value in a group. The latter is to summarise all the paths in the XML data sources. In order to reduce the size of index, index compacting and node selection strategies are presented. Based on the structure, efficient data sources selection methods are designed. Extensive experiments are performed to demonstrate the efficiency and effectiveness of the structure and data sources selection strategies presented in this paper.

Archive | 2018

A Method to Identify Spark Important Parameters Based on Machine Learning

Tianyu Li; Shengfei Shi; Jizhou Luo; Hongzhi Wang

Apache Spark is the most popular open-source framework today that uses an in-memory-oriented abstraction Resilient Distributed Dataset (RDD) to process large-scale data. Recently, research work on performance prediction and optimization for Spark platform continues to increase rapidly. However, selecting important configuration parameters in most wok is always dependent on the experience of domain experts yet. Therefore, configuration parameters selection based on machine learning algorithms is a non-trivial research issue. In this paper, a method based on machine learning to identify Spark important parameters ISIP is proposed. By providing a relatively important subset of configuration parameters, the parameter space for performance tuning on Spark can be reduced, thereby saving the time and effort of users or researchers. ISIP uses Mean-shift algorithm to cluster the applications based on the workload characteristics of the applications from Spark MLlib. Then the relationship between the performance and the configuration parameters is modeled by Regression Algorithm. In the meanwhile, the ranked list of parameters by their importance is provided respectively for each type of applications. The subset of most important configuration parameters consists of the parameters at the front of the list. The experimental results show that the effect of adjusting the subset of relatively important configuration parameters provided by ISIP is almost the same as the complete parameters set.

Archive | 2018

An Anomaly Detection Method Based on Learning of “Scores Sequence”

Dongsheng Li; Shengfei Shi; Yan Zhang; Hongzhi Wang; Jizhou Luo

Anomaly detection is very important in the field of operation and maintenance (O&M). However, in O&M, we find that direct use of the existing anomaly detection algorithms often causes a large number of false positives, and the detection results are not stable. Nothing a data characteristics in O&M: Many anomalies are often anomalous time periods formed by continuous anomaly points, we propose a novel concept “Scores Sequence” and a method based on learning of Scores Sequence. Our method has less false positives, can detect anomaly timely, and the detection result of our method is very stable. Through comparative experiments with many algorithms and practical industrial application, it proves that our method has good performance and is very suitable for the anomaly detection in O&M.

Journal of Computer Science and Technology | 2018

O2iJoin: An Efficient Index-Based Algorithm for Overlap Interval Join

Jizhou Luo; Shengfei Shi; Guang Yang; Hongzhi Wang

Time intervals are often associated with tuples to represent their valid time in temporal relations, where overlap join is crucial for various kinds of queries. Many existing overlap join algorithms use indices based on tree structures such as quad-tree, B+-tree and interval tree. These algorithms usually have high CPU cost since deep path traversals are unavoidable, which makes them not so competitive as data-partition or plane-sweep based algorithms. This paper proposes an efficient overlap join algorithm based on a new two-layer flat index named as Overlap Interval Inverted Index (i.e., O2i Index). It uses an array to record the end points of intervals and approximates the nesting structures of intervals via two functions in the first layer, and the second layer uses inverted lists to trace all intervals satisfying the approximated nesting structures. With the help of the new index, the join algorithm only visits the must-be-scanned lists and skips all others. Analyses and experiments on both real and synthetic datasets show that the proposed algorithm is as competitive as the state-of-the-art algorithms.

Future Generation Computer Systems | 2018

A gray-box performance model for Apache Spark

Zemin Chao; Shengfei Shi; Hong Gao; Jizhou Luo; Hongzhi Wang

Abstract Apache Spark is a powerful open source data processing platform. It is getting more and more popular with the growing need of processing massive amounts of data. A performance prediction model not only helps administrators to have a better understanding of system behavior, but also is useful in performance tuning. However, considering the complex application processing mechanism of Spark, it is not an easy job to model the relationship between system performance and configuration settings. In this paper, we present a gray-box performance model for Spark applications based on machine learning algorithms. Given a specific Spark application, the size of its input data and some key system parameters, this performance model is able to forecast its execution time according to history information. To achieve better accuracy, our model takes basic hardware information and the resource allocation strategy of Spark into consideration. In our experiments, result shows our gray-box model is better than typical black-box approaches in most of the cases. We consider this model is helpful for further researches on Apache Spark.

Explore More