Yunfa Hu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yunfa Hu is active.

Explore More

Publication

Featured researches published by Yunfa Hu.

pacific asia conference on knowledge discovery and data mining | 2000

Combining Sampling Technique with DBSCAN Algorithm for Clustering Large Spatial Databases

Shuigeng Zhou; Aoying Zhou; Jing Cao; Jin Wen; Ye Fan; Yunfa Hu

In this paper, we combine sampling technique with DBSCAN algorithm to cluster large spatial databases, two sampling-based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN; and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large-scale spatial databases.

software engineering, artificial intelligence, networking and parallel/distributed computing | 2007

An Effective Method To Improve kNN Text Classifier

Xiulan Hao; Xiaopeng Tao; Chenghong Zhang; Yunfa Hu

Many of standard classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many applications. As a simple, effective categorization method, kNN is widely used, but it suffers from biased data sets, too. In developing the Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare categories are classified into common ones. To alleviate such a misfortune, we propose a novel concept, critical point (CP), and adapt traditional kNN by integrating CPs approximate value, LB or UB, training number with decision rules. Exhaustive experiments illustrate that the adapted kNN achieves significant classification performance improvement on biased corpora.

fuzzy systems and knowledge discovery | 2007

A Machine Learning Approach Classification of Deep Web Sources

HeXiang Xu; Chenghong Zhang; Xiulan Hao; Yunfa Hu

The classification of deep Web sources is an important area in large-scale deep Web integration, which is still at an early stage. Many deep web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. To date, in terms of the classification, existing works mainly focus on classifying texts or Web documents, and there is little in the deep web. In this paper, we present a deep Web model and machine learning based classifying model. The experimental results show that we can achieve a good performance with a small scale training samples for each domain, and as the number of training samples increases, the performance keeps stabilization.

annual acis international conference on computer and information science | 2008

Subspace Clustering of High Dimensional Data Streams

Shuyun Wang; Yingjie Fan; Chenghong Zhang; HeXiang Xu; Xiulan Hao; Yunfa Hu

In this paper, SOStream, which is a novel algorithm of clustering over high dimensional online data stream is presented, it is based on subspace.-SOStream partitions the data space into grids, and maintains a superset of all dense units in an online way. A deterministic lower and upper bound of the selectivity of each maintained units are also given. With the maintained potential dense units, SOStream is capable of discovering the clusters in different subspaces over high dimensional data stream with arbitrary shape. The experimental results on real and synthetic datasets demonstrate the effectivity of the approach.

international conference on machine learning and cybernetics | 2007

A Method of Deep Web Classification

HeXiang Xu; Xiulan Hao; Shuyun Wang; Yunfa Hu

The research on deep Web classification is an important area in large-scale deep Web integration, which is still at its early stage. Many deep Web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. In this paper, we present an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM). The experimental results show that we can get a good performance with average precision 91.6% and average recall 92.4%.

annual acis international conference on computer and information science | 2008

An Improved Condensing Algorithm

Xiulan Hao; Chenghong Zhang; HeXiang Xu; Xiaopeng Tao; Shuyun Wang; Yunfa Hu

kNN classifier is widely used in text categorization, however, kNN has the large computational and store requirements, and its performance also suffers from uneven distribution of training data. Usually, condensing technique is resorted to reducing the noises of training data and decreasing the cost of time and space. Traditional condensing technique picks up samples in a random manner when initialization. Though random sampling is one means to reduce outliers, the extremely stochastic may lead to bad performance sometimes, that is, advantages of sampling may be suppressed. To avoid such a misfortune, we propose a variation of traditional condensing technique. Experiment results illustrate this strategy can solve above problems effectively.

annual acis international conference on computer and information science | 2008

Entropy Based Clustering of Data Streams with Mixed Numeric and Categorical Values

Shuyun Wang; Yingjie Fan; Chenghong Zhang; HeXiang Xu; Xiulan Hao; Yunfa Hu

In is paper, a novel algorithm for clustering data streams with mixed numeric and categorical attributes (CNC-Stream)is proposed. A new similarity measure based on entropy determining the similarity between the objects(data points in the stream or the micro- clusters in memory) is also presented here, which makes CNC-Stream work, the experiments conducted on the real data sets and synthetic data sets show that the proposed method is of high quality.

ieee international conference on high performance computing data and analytics | 2000

Hierarchically distributed data warehouse

Shuigeng Zhou; Aoying Zhou; Xiaopeng Tao; Yunfa Hu

Data warehouses (DW) are a rapidly developing field of both application and research. Up to now, three types of data warehouse have been proposed, which include centralized DW, data mart, and distributed DW. In this paper, a new type of data warehouse, called hierarchically distributed data warehouse (HDDW), is developed on the basis of a DW building case. HDDW-oriented OLAP is also studied and a C/S architecture for OLAP of HDDW is given.

international conference on emerging technologies | 2007

Finding frequent items in data streams using ESBF

Shuyun Wang; Xiulan Hao; HeXiang Xu; Yunfa Hu

In this paper, we introduce a novel data structure, ESBF (Ex- tensible and Scalable Bloom Filter), and the algorithm FI-ESBF (Finding frequent Items using ESBF) for estimating the frequent items in data streams. FI-ESBF can work with high precision while using much less memory than those of the best reported algorithm does considering the large number of distinct items in the stream. ESBF is the extension of counting Bloom Filter(CBF), By using it, we are allowed to adjust the size of memory used dynamically according to the different data distribution and the number of distinct items in the data streams, therefore the priori knowledge about the data distribution of the streams and the number of distinct elements to be stored is not required.

web age information management | 2000

Hierachically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure

Shuigeng Zhou; Ye Fan; Jiangtao Hu; Fang Yu; Yunfa Hu

This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently classify Chinese Web documents.

Explore More