Is this you? Create Your Porfile

Gongqing Wu

Hefei University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gongqing Wu is active.

Explore More

Publication

Featured researches published by Gongqing Wu.

IEEE Transactions on Knowledge and Data Engineering | 2014

Data mining with big data

Xindong Wu; Xingquan Zhu; Gongqing Wu; Wei Ding

Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

hawaii international conference on system sciences | 2011

K-Means Clustering with Bagging and MapReduce

Haiguang Li; Gongqing Wu; Xuegang Hu; Jing Zhang; Lian Li; Xindong Wu

Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences over biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data objects. K-means is one of the most famous clustering algorithms. Its simplicity and speed allow it to run on large data sets. However, it also has several drawbacks. First, this algorithm is instable and sensitive to outliers. Second, its performance will be inefficient when dealing with large data sets. In this paper, a method is proposed to solve those problems, which uses an ensemble learning method bagging to overcome the instability and sensitivity to outliers, while using a distributed computing framework MapReduce to solve the inefficiency problem in clustering on large data sets. Extensive experiments have been performed to show that our approach is efficient.

grid computing | 2012

A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

Jing Zhang; Gongqing Wu; Xuegang Hu; Xindong Wu

The improvement of file access performance is a great challenge in real-time cloud services. In this paper, we analyze preconditions of dealing with this problem considering the aspects of requirements, hardware, software, and network environments in the cloud. Then we describe the design and implementation of a novel distributed layered cache system built on the top of the Hadoop Distributed File System which is named HDFS-based Distributed Cache System (HDCache). The cache system consists of a client library and multiple cache services. The cache services are designed with three access layers an in-memory cache, a snapshot of the local disk, and the actual disk view as provided by HDFS. The files loading from HDFS are cached in the shared memory which can be directly accessed by a client library. Multiple applications integrated with a client library can access a cache service simultaneously. Cache services are organized in the P2P style using a distributed hash table. Every file cached has three replicas in different cache service nodes in order to improve robustness and alleviates the workload. Experimental results show that the novel cache system can store files with a wide range in their sizes and has the access performance in a millisecond level in highly concurrent environments.

chinagrid annual conference | 2009

MReC4.5: C4.5 Ensemble Classification with MapReduce

Gongqing Wu; Haiguang Li; Xuegang Hu; Yuanjun Bi; Jing Zhang; Xindong Wu

Classification is a significant technique in data mining research and applications. C4.5 is a widely used classification method, and ensemble learning adopts a parallel and distributed computing model for classification. Based on analyses of the MapReduce computing paradigm and the process of ensemble learning, we find that the parallel and distributed computing model in MapReduce is appropriate for implementing ensemble learning. This paper takes the advantages of C4.5, ensemble learning and the MapReduce computing model, and proposes a new method MReC4.5 for parallel and distributed ensemble classification. Our experimental results show that increasing the number of nodes would benefit the effectiveness of classification modeling, and serialization operations at the model level make the MReC4.5 classifier “construct once, use anywhere”.

IEEE Intelligent Systems | 2010

News Filtering and Summarization on the Web

Xindong Wu; Gongqing Wu; Fei Xie; Zhu Zhu; Xuegang Hu; Hao Lu; Huiqian Li

The news filtering and summarization (NFAS) system can automatically recognize Web news pages, retrieve each news pages title and news content, and extract key phrases. This extraction method substantially outperforms methods based on term frequency and lexical chains.

IEEE Intelligent Systems | 2015

Knowledge Engineering with Big Data

Xindong Wu; Huanhuan Chen; Gongqing Wu; Jun Liu; Qinghua Zheng; Xiaofeng He; Aoying Zhou; Zhong-Qiu Zhao; Bifang Wei; Yang Li; Qiping Zhang; Shichao Zhang

In the era of big data, knowledge engineering faces fundamental challenges induced by fragmented knowledge from heterogeneous, autonomous sources with complex and evolving relationships. The knowledge representation, acquisition, and inference techniques developed in the 1970s and 1980s, driven by research and development of expert systems, must be updated to cope with both fragmented knowledge from multiple sources in the big data revolution and in-depth knowledge from domain experts. This article presents BigKE, a knowledge engineering framework that handles fragmented knowledge modeling and online learning from multiple information sources, nonlinear fusion on fragmented knowledge, and automated demand-driven knowledge navigation.

Journal of Computers | 2013

A Parallel Clustering Algorithm with MPI - MKmeans

Jing Zhang; Gongqing Wu; Xuegang Hu; Shiying Li; Shuilong Hao

Clustering is one of the most popular methods for exploratory data analysis, which is prevalent in many disciplines such as image segmentation, bioinformatics, pattern recognition and statistics etc. The most famous clustering algorithm is K-means because of its easy implementation, simplicity, efficiency and empirical success. However, the real-world applications produce huge volumes of data, thus, how to efficiently handle of these data in an important mining task has been a challenging and significant issue. In addition, MPI (Message Passing Interface) as a programming model of message passing presents high performances, scalability and portability. Motivated by this, a parallel K-means clustering algorithm with MPI, called MKmeans, is proposed in this paper. The algorithm enables applying the clustering algorithm effectively in the parallel environment. Experimental study demonstrates that MKmeans is relatively stable and portable, and it performs with low overhead of time on large volumes of data sets. Index Terms—clustering, K-means algorithm, MPI, parallel computing

international symposium on parallel architectures, algorithms and programming | 2011

A Parallel K-Means Clustering Algorithm with MPI

Jing Zhang; Gongqing Wu; Xuegang Hu; Shiying Li; Shuilong Hao

Clustering is one of the most popular methods for data analysis, which is prevalent in many disciplines such as image segmentation, bioinformatics, pattern recognition and statistics etc. The most popular and simplest clustering algorithm is K-means because of its easy implementation, simplicity, efficiency and empirical success. However, the real-world applications produce huge volumes of data, thus, how to efficiently handle of these data in an important mining task has been a challenging and significant issue. In addition, MPI (Message Passing Interface) as a programming model of message passing presents high performances, scalability and portability. Motivated by this, a parallel K-means clustering algorithm with MPI, called MKmeans, is proposed in this paper. The algorithm enables applying the clustering algorithm effectively in the parallel environment. Experimental study demonstrates that MKmeans is relatively stable and portable, and it performs with low overhead of time on large volumes of data sets.

Journal of Computer Science and Technology | 2007

A semi-random multiple decision-tree algorithm for mining data streams

Xuegang Hu; Peipei Li; Xindong Wu; Gongqing Wu

Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naïve Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.

chinagrid annual conference | 2010

A 2-Tier Clustering Algorithm with Map-Reduce

Jing Zhang; Gongqing Wu; Haiguang Li; Xuegang Hu; Xindong Wu

In the field of data mining, clustering is one of the important methods. K-Means is a typical distance-based clustering algorithm; 2-tier clustering should implement scalable clustering by means of dividing, sampling and knowledge integrating. Among those tools of distributed processing, Map-Reduce has been widely embraced by both academia and industry. Hadoop is an open-source parallel and distributed programming framework for the implementation of Map-Reduce computing model. With the analysis of the Map-Reduce paradigm of computing, we find that Hadoop parallel and distributed computing model is appropriate for the implementation of scalable clustering algorithm. This paper takes advantages of K-Means, 2-tier clustering mechanism and Map-Reduce computing model; proposes a new method for parallel and distributed clustering to explore distributed clustering problem based on Map-Reduce. The method aims to apply the clustering algorithm effectively to the distributed environment. The extensive studies demonstrate that the proposed algorithm is scalable, and the time performance is stable. Meanwhile, adding number of cluster nodes would improve the time performance of clustering.

Explore More