Xiulan Hao
Fudan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiulan Hao.
software engineering, artificial intelligence, networking and parallel/distributed computing | 2007
Xiulan Hao; Xiaopeng Tao; Chenghong Zhang; Yunfa Hu
Many of standard classification algorithms usually assume that the training examples are evenly distributed among different classes. However, unbalanced data sets often appear in many applications. As a simple, effective categorization method, kNN is widely used, but it suffers from biased data sets, too. In developing the Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare categories are classified into common ones. To alleviate such a misfortune, we propose a novel concept, critical point (CP), and adapt traditional kNN by integrating CPs approximate value, LB or UB, training number with decision rules. Exhaustive experiments illustrate that the adapted kNN achieves significant classification performance improvement on biased corpora.
fuzzy systems and knowledge discovery | 2007
HeXiang Xu; Chenghong Zhang; Xiulan Hao; Yunfa Hu
The classification of deep Web sources is an important area in large-scale deep Web integration, which is still at an early stage. Many deep web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. To date, in terms of the classification, existing works mainly focus on classifying texts or Web documents, and there is little in the deep web. In this paper, we present a deep Web model and machine learning based classifying model. The experimental results show that we can achieve a good performance with a small scale training samples for each domain, and as the number of training samples increases, the performance keeps stabilization.
annual acis international conference on computer and information science | 2008
Shuyun Wang; Yingjie Fan; Chenghong Zhang; HeXiang Xu; Xiulan Hao; Yunfa Hu
In this paper, SOStream, which is a novel algorithm of clustering over high dimensional online data stream is presented, it is based on subspace.-SOStream partitions the data space into grids, and maintains a superset of all dense units in an online way. A deterministic lower and upper bound of the selectivity of each maintained units are also given. With the maintained potential dense units, SOStream is capable of discovering the clusters in different subspaces over high dimensional data stream with arbitrary shape. The experimental results on real and synthetic datasets demonstrate the effectivity of the approach.
international conference on machine learning and cybernetics | 2007
HeXiang Xu; Xiulan Hao; Shuyun Wang; Yunfa Hu
The research on deep Web classification is an important area in large-scale deep Web integration, which is still at its early stage. Many deep Web sources are structured by providing structured query interfaces and results. Classifying such structured sources into domains is one of the critical steps toward the integration of heterogeneous Web sources. In this paper, we present an ontology-based deep Web classification, which includes a category ontology model and a deep Web vector space model (VSM). The experimental results show that we can get a good performance with average precision 91.6% and average recall 92.4%.
annual acis international conference on computer and information science | 2008
Xiulan Hao; Chenghong Zhang; HeXiang Xu; Xiaopeng Tao; Shuyun Wang; Yunfa Hu
kNN classifier is widely used in text categorization, however, kNN has the large computational and store requirements, and its performance also suffers from uneven distribution of training data. Usually, condensing technique is resorted to reducing the noises of training data and decreasing the cost of time and space. Traditional condensing technique picks up samples in a random manner when initialization. Though random sampling is one means to reduce outliers, the extremely stochastic may lead to bad performance sometimes, that is, advantages of sampling may be suppressed. To avoid such a misfortune, we propose a variation of traditional condensing technique. Experiment results illustrate this strategy can solve above problems effectively.
annual acis international conference on computer and information science | 2008
Shuyun Wang; Yingjie Fan; Chenghong Zhang; HeXiang Xu; Xiulan Hao; Yunfa Hu
In is paper, a novel algorithm for clustering data streams with mixed numeric and categorical attributes (CNC-Stream)is proposed. A new similarity measure based on entropy determining the similarity between the objects(data points in the stream or the micro- clusters in memory) is also presented here, which makes CNC-Stream work, the experiments conducted on the real data sets and synthetic data sets show that the proposed method is of high quality.
international conference on emerging technologies | 2007
Shuyun Wang; Xiulan Hao; HeXiang Xu; Yunfa Hu
In this paper, we introduce a novel data structure, ESBF (Ex- tensible and Scalable Bloom Filter), and the algorithm FI-ESBF (Finding frequent Items using ESBF) for estimating the frequent items in data streams. FI-ESBF can work with high precision while using much less memory than those of the best reported algorithm does considering the large number of distinct items in the stream. ESBF is the extension of counting Bloom Filter(CBF), By using it, we are allowed to adjust the size of memory used dynamically according to the different data distribution and the number of distinct items in the data streams, therefore the priori knowledge about the data distribution of the streams and the number of distinct elements to be stored is not required.
fuzzy systems and knowledge discovery | 2007
Shuyun Wang; Xiulan Hao; HeXiang Xu; Yunfa Hu
This paper introduce the algorithm MIBFD (mining frequent items using bloom filter based on damped model) for mining recent frequent items in data streams. Based on an efficient data structure named extensible and scalable bloom filter(ESBF), MIBFD is able to adjust the size of memory used dynamically. Theoretical analysis and experiments show that MIBFD is efficient both in processing time and in memory usage.
international conference on machine learning and cybernetics | 2007
Xiulan Hao; Chenghong Zhang; Shuyun Wang; Xiaopeng Tao; Yunfa Hu
As a simple and effective classification approach, KNN is widely used in text categorization. However, KNN classifier not only has the large computational and store requirements, but also deteriorates performance of classification because of uneven distribution of training data. In this paper, we present a combinational technique, multi-edit-nearest-neighbor and condensing techniques, for reducing the noises of training data and decreasing the cost of time and space. Our experiment results illustrate that this strategy can solve above problems effectively.
fuzzy systems and knowledge discovery | 2007
Xiulan Hao; Chenghong Zhang; Xiaopeng Tao; Shuyun Wang; Yunfa Hu
Text classification is one of means to understand text content. It is widely used in information retrieving, filtering spam, monitoring ill gossips, and blocking pornographic and evil messages. kN N is widely used in text categorization, but it suffers from biased training data set. In developing Prototype of Internet Information Security for Shanghai Council of Information and Security, we detect that when training data set is biased, almost all test documents of some rare (smaller) categories are classified into common (larger) ones by traditional kN N classifier. The performance of text classification can not satisfy the users requirement in this case. To alleviate such a misfortune, we adopt 2 measures to boost kN N classifier. Firstly, we optimize features by removing some candidate features. Secondly, we modify traditional decision rules by integrating number of training samples of each category with them. Exhaustive experiments illustrate that the adapted kN N achieves significant classification performance improvement on biased corpora.