Haiguang Li
University of Vermont
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Haiguang Li.
hawaii international conference on system sciences | 2011
Haiguang Li; Gongqing Wu; Xuegang Hu; Jing Zhang; Lian Li; Xindong Wu
Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences over biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data objects. K-means is one of the most famous clustering algorithms. Its simplicity and speed allow it to run on large data sets. However, it also has several drawbacks. First, this algorithm is instable and sensitive to outliers. Second, its performance will be inefficient when dealing with large data sets. In this paper, a method is proposed to solve those problems, which uses an ensemble learning method bagging to overcome the instability and sensitivity to outliers, while using a distributed computing framework MapReduce to solve the inefficiency problem in clustering on large data sets. Extensive experiments have been performed to show that our approach is efficient.
chinagrid annual conference | 2009
Gongqing Wu; Haiguang Li; Xuegang Hu; Yuanjun Bi; Jing Zhang; Xindong Wu
Classification is a significant technique in data mining research and applications. C4.5 is a widely used classification method, and ensemble learning adopts a parallel and distributed computing model for classification. Based on analyses of the MapReduce computing paradigm and the process of ensemble learning, we find that the parallel and distributed computing model in MapReduce is appropriate for implementing ensemble learning. This paper takes the advantages of C4.5, ensemble learning and the MapReduce computing model, and proposes a new method MReC4.5 for parallel and distributed ensemble classification. Our experimental results show that increasing the number of nodes would benefit the effectiveness of classification modeling, and serialization operations at the model level make the MReC4.5 classifier “construct once, use anywhere”.
chinagrid annual conference | 2010
Jing Zhang; Gongqing Wu; Haiguang Li; Xuegang Hu; Xindong Wu
In the field of data mining, clustering is one of the important methods. K-Means is a typical distance-based clustering algorithm; 2-tier clustering should implement scalable clustering by means of dividing, sampling and knowledge integrating. Among those tools of distributed processing, Map-Reduce has been widely embraced by both academia and industry. Hadoop is an open-source parallel and distributed programming framework for the implementation of Map-Reduce computing model. With the analysis of the Map-Reduce paradigm of computing, we find that Hadoop parallel and distributed computing model is appropriate for the implementation of scalable clustering algorithm. This paper takes advantages of K-Means, 2-tier clustering mechanism and Map-Reduce computing model; proposes a new method for parallel and distributed clustering to explore distributed clustering problem based on Map-Reduce. The method aims to apply the clustering algorithm effectively to the distributed environment. The extensive studies demonstrate that the proposed algorithm is scalable, and the time performance is stable. Meanwhile, adding number of cluster nodes would improve the time performance of clustering.
international conference on data mining | 2013
Haiguang Li; Xindong Wu; Zhao Li; Wei Ding
Group feature selection makes use of structural information among features to discover a meaningful subset of features. Existing group feature selection algorithms only deal with pre-given candidate feature sets and they are incapable of handling streaming features. On the other hand, feature selection algorithms targeted for streaming features can only perform at the individual feature level without considering intrinsic group structures of the features. In this paper, we perform group feature selection with streaming features. We propose to perform feature selection at the group and individual feature levels simultaneously in a manner of a feature stream rather than a pre-given candidate feature set. In our approach, the group structures are fully utilized to reduce the cost of evaluating streaming features. We have extensively evaluated the proposed method. Experimental results have demonstrated that our proposed algorithms statistically outperform state-of-the-art methods of feature selection in terms of classification accuracy.
acm symposium on applied computing | 2014
Haiguang Li; Xindong Wu; Zhao Li
Currently, mobile devices built with powerful embedded sensors create new opportunities for data mining applications such as monitoring user activity. In this paper, we target at user recognition based on sensor data of remote control, in which activity recognition determines a users action that is in favor of collecting ones individual sensor data to identify different users. This new problem faces two challenges: first, sensor data is sensitive and constantly changing which is difficult to obtain meaningful features; second, streaming sensor data for online learning is usually imbalanced on which traditional classifiers are not well performed. To address these challenges, we introduce an efficient activity recognition algorithm by exploring the physical appearance of sensor data, and then an online incremental classifier to deal with imbalanced data streams by adaptively generating training data. Extensive online and offline experiments demonstrate that our proposed method outperforms state-of-the-art algorithms in terms of accuracy.
fuzzy systems and knowledge discovery | 2009
Gongqing Wu; Xindong Wu; Xuegang Hu; Haiguang Li; Ying Liu; Ren-Gan Xu
Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and path patterns. Compared with the delimiting features of Web content, path patterns have many advantages, such as a high positioning accuracy, ease of use and a strong pervasive performance. Consequently, a Web information extraction model with path patterns constructed from a path pattern mining algorithm is proposed in this paper. Our experimental data set is obtained by randomly selecting news Web pages from the CNN website. With a reasonable tolerance threshold, the experimental results show that the average precision is above 99% and the average recall is 100% when we integrate Web information extraction with our path pattern mining algorithm. The performance of path patterns from the pattern mining algorithm is much better than that of priori extraction rules configured by domain knowledge.
Applied Intelligence | 2013
Haiguang Li; Zhao Li; Robert T. White; Xindong Wu
In recent years, the use of advanced technologies such as wireless communication and sensors in intelligent transportation systems has made a significant increase in traffic data available. With this data, traffic prediction has the ability to improve traffic conditions and to reduce travel delays by facilitating better utilization of available capacity. This paper presents a real-time transportation prediction system named VTraffic for Vermont Agencies of Transportation by integrating traffic flow theory, advanced sensors, data gathering, data integration, data mining and visualization technologies to estimate and visualize the current and future traffic. In the VTraffic system, acoustic sensors were installed to monitor and to collect real-time data. Reliable predictions can be obtained from historical data and be verified and refined by the current and near future real-time data.
granular computing | 2009
Haiguang Li; Gongqing Wu; Xuegang Hu; Xindong Wu; Yuanjun Bi; Peipei Li
Named Entity Relations are a foundation of semantic networks, ontology and the semantic Web, and are widely used in information retrieval and machine translation, as well as automatic question and answering systems. Relation feature selection and extraction are two key issues. The location features possess excellent computability and operability, and the semantic features have strong intelligibility and reality. Currently, relation extraction of Chinese named entities mainly adopts the Vector Space Model (VSM) or a traditional semantic computing method, and these two methods use either the location features or the semantic features only, resulting in unsatisfactory extraction. To improve the extraction results, we propose a method that combines the information gain of the positions of words and the semantic computing based on HowNet to extract Chinese named entity relations, and present a relation extraction method of Chinese named entities, called LSE, which is scalable, semi-supervised and domain independent. Extensive experiments have been performed to show that our approach is superior, with an F-score of 0.881, which is at least 0.115 better than existing extraction methods that use either the location features or the semantic features.
national conference on artificial intelligence | 2013
Haiguang Li; Xindong Wu; Zhao Li; Wei Ding