Zhenyu Dai
Guizhou University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhenyu Dai.
international conference on algorithms and architectures for parallel processing | 2015
Hui Li; Nengjun Qiu; Mei Chen; Hongyuan Li; Zhenyu Dai; Ming Zhu; Menglin Huang
With the development of science and technology, the data size and complexity of scientific data are increased rapidly, which made efficient data storage and parallel analysis of scientific data become a big challenge. The previous techniques that combine the traditional relational database with analysis software tends cannot efficiently meet the performance requirement of large scale scientific data based analysis. In this paper, we present FASTDB, a distributed array database system that optimized for massive scientific data management and provide a share-nothing, parallel array processing analysis. In order to demonstrate the intrinsic performance characteristics of FASTDB, we applied it into the interactive analysis of data from astronomical surveys, and designed a series of experiments with scientific analysis tasks. According to the experimental results, we found FASTDB can be significantly fast than traditional database based SkyServer in many typical analytical scenarios.
international conference on algorithms and architectures for parallel processing | 2015
Hui Li; Xiaohuan Hou; Mei Chen; Zhenyu Dai; Ming Zhu; Menglin Huang
In order to store and process data at large-scale, distributed databases are built to partition data and process it in parallel on distributed nodes in a cluster. When the database concurrently execute heterogeneous query workloads, performance prediction is needed. However, running queries in a distributed database heavily touches upon the network overhead as the data transmission between cluster nodes. Hence, in this work, we take network latency into account when predict concurrent query performance. We propose a linear regression model to estimate the interaction when execute concurrent query for analytical workloads in distributed database system. Since network latency and overheads of local processing are the two most significant factors for query execution, we analyze the query behavior with multivariate regression on both of them at different degree of concurrency. In addition, we use sampling techniques to obtain various query mixes as concurrency level increasing. The experiments for evaluation the performance of our prediction model are conducted over a PostgreSQL database cluster with a representative analytical workloads of TPC-H, the experimental results demonstrates that the query latency predictions of our model can minimize the relative error within 14 % on average.
Archive | 2019
Yang Chen; Hui Li; Mei Chen; Zhenyu Dai; Huanjun Li; Ming Zhu
Feature selection is an important data analysis technique that used to reduce the redundancy of features and exploit hidden information in high-dimensional data. In this paper we propose a similarity metric based feature selection method named Fesim. We use the Euclidean distance to measure the similarity among all features, and then apply the density based DBSCAN algorithm to clustering features which to be relevant. Moreover, we present a strategy which choose representative features of each cluster accurately. We conducted comprehensive experiments to evaluate the proposed approach, and the results on different datasets are demonstrated its superiority.
computer science on-line conference | 2018
Junpeng Zhu; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
Sampling technique has become one of the recent research focuses in the graph-related fields. Most of the existing graph sampling algorithms tend to sample the high degree or low degree nodes in the complex networks because of the characteristic of scale-free. Scale-free means that degrees of different nodes are subject to a power law distribution. So, there is a significant difference in the degrees between the overall sampling nodes. In this paper, we propose an idea of approximate degree distribution and devise a stratified strategy using it in the complex networks. We also develop two graph sampling algorithms combining the node selection method with the stratified strategy. The experimental results show that our sampling algorithms preserve several properties of different graphs and behave more accurately than other algorithms. Further, we prove the proposed algorithms are superior to the off-the-shelf algorithms in terms of the unbiasedness of the degrees and more efficient than state-of-the-art FFS and ES-i algorithms.
computer science on-line conference | 2018
Jianfeng Zhang; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
In order to reduce the large network overhead and the heavy cost of cross-match on the astronomical catalog in the database cluster, we proposed a novel method of cross-matches based on Roaring Bitmap. Firstly, we store astronomical catalog data in column-oriented storage with compression setup to reduce I/O overhead of accessing field in the parallel database system. Secondly, we create the spatial index, which maps the 2D coordinates into integer number. Then, using Roaring Bitmap convert the spatial index into a bitmap index. Finally, the received spatial range search of cross-match is translated into bitmap operations to achieve batch processing. The experiments over the real large-scale astronomical data show that the proposed method is 4 to 10 times faster than traditional method, meanwhile, only consume less than 10% of memory resource.
computer science on-line conference | 2018
Jianping Zhang; Hui Li; Xiaoping Zhang; Mei Chen; Zhenyu Dai; Ming Zhu
In heterogeneous data processing, various data model often make analytic task too hard to achieve optimal performance, it is necessary to unify heterogeneous data into the same data model. How to determine the proper intermediate data model and unify the involved heterogeneous data models for the analytical task is an urgent problem need to be solved. In this paper, we proposed a model determination method based on cost estimation. It evaluates the execution cost of query tasks on different data models, which taken as the criterion to measure the data model, and chooses a data model with the least cost as the intermediate representation during data processing. The experimental results of BigBench datasets showed that the proposed cost estimation based method could appropriately determine the data model, which made heterogeneous data processing efficiently.
computer science on-line conference | 2018
Qingnan Zhao; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
Data exploration has been proved to be an efficient solution to learn interesting new insights from dataset in an intuitional way. Typically, discovering interesting patterns and objects over high-dimensional dataset is often very difficult due to its large search space. In this paper, we developed a data exploration method named Decision Analysis of Cross Clustering (DACC) based on subspace clustering. It characterize the data objects in the representation of decision trees over divided clustering subspace, which help users quickly understand the patterns of the data and then make interactive exploration easier. We conducted a series of experiments over the real-world datasets and the results showed that, DACC is superior to the representative data explorative approach in term of efficiency and accuracy, and it is applicable for interactive exploration analysis of high-dimensional data sets.
computer science on-line conference | 2018
Ruping Wang; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
Clustering algorithms often use distance measure as the measure of similarity between point pairs. Such clustering algorithms are difficult to deal with the curse of dimensionality in high-dimension space. In order to address this issue which is common in clustering algorithms, we proposed to use MIC instead of distance measure in k-means clustering algorithm and implemented the novel MIC-kmeans algorithm for high-dimension clustering. MIC-kmeans can cluster the data with correlation to avoid the problem of distance failure in high-dimension space. The experimental results over the synthetic data and real datasets show that MIC-kmeans is superior to k-means clustering algorithm based on distance measure.
Archive | 2018
Li Tang; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
In representative ETL software such as Informatica and DataStage, their ETL task scheduler are only supports timing scheduling, meanwhile, they neither take the resource consumption into consideration nor make the user to be easy to configure resource utilization strategy, which often make concurrent task schedule to be inefficient. In this paper, we propose a long-term altruism strategy based concurrent ETL task schedule approach named Aetsa to solve this problem. In order to make critical jobs are have the needed resource to execute efficiently, Aetsa approach can pause certain jobs temporarily, and then schedule other jobs execute in a higher priority. After that, once there available appropriate resource for the aforementioned paused jobs, Aetsa will resume it for further execution. We evaluate the efficiency of Aetsa in our real medical data integration scenarios. It involved medical datasets from more than 1600 primary health care institution of GuiZhou province, China. Our experimental result show that Aetsa’s average waiting time is very close to the well-known scheduling solution SJF, meanwhile, compared with FCFS, both the average response time and the efficiency are achieved satisfactory improvement.
International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage | 2016
Shengtian Min; Hui Li; Mei Chen; Zhenyu Dai; Ming Zhu
High concurrent analytic applications such as SaaS based BI services face the problem of how to meet performance SLAs (Service Level Agreement) when the number of users and concurrency increased. In order to reduce the task processing overheads and services response time, analytic applications tend heavily rely on various main-memory data management and cache techniques. In this paper, we designed a cost-conscious cache replacement approach named CRSR (Cost-conscious Result Sets Replacement), which take task result sets as the essential data unit and replace the existing result set by a specialized cost estimation strategy. We conduct a series of evaluation to compare the proposed CRSR approach with representative cache management methods, the experiments show that in most cases, the proposed CRSR algorithm can efficiently reduce the response time of high concurrency analysis service and outperform its competitors.