Jiang Su
University of New Brunswick
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jiang Su.
international conference on machine learning | 2008
Jiang Su; Harry Zhang; Charles X. Ling; Stan Matwin
Bayesian network classifiers have been widely used for classification problems. Given a fixed Bayesian network structure, parameters learning can take two different approaches: generative and discriminative learning. While generative parameter learning is more efficient, discriminative parameter learning is more effective. In this paper, we propose a simple, efficient, and effective discriminative parameter learning method, called Discriminative Frequency Estimate (DFE), which learns parameters by discriminatively computing frequencies from data. Empirical studies show that the DFE algorithm integrates the advantages of both generative and discriminative learning: it performs as well as the state-of-the-art discriminative parameter learning method ELR in accuracy, but is significantly more efficient.
european conference on machine learning | 2004
Harry Zhang; Jiang Su
It is well-known that naive Bayes performs surprisingly well in classification, but its probability estimation is poor. In many applications, however, a ranking based on class probabilities is desired. For example, a ranking of customers in terms of the likelihood that they buy ones products is useful in direct marketing. What is the general performance of naive Bayes in ranking? In this paper, we study it by both empirical experiments and theoretical analysis. Our experiments show that naive Bayes outperforms C4.4, the most state-of-the-art decision-tree algorithm for ranking. We study two example problems that have been used in analyzing the performance of naive Bayes in classification [3]. Surprisingly, naive Bayes performs perfectly on them in ranking, even though it does not in classification. Finally, we present and prove a sufficient condition for the optimality of naive Bayes in ranking.
international conference on machine learning | 2006
Jiang Su; Harry Zhang
The structure of a Bayesian network (BN) encodes variable independence. Learning the structure of a BN, however, is typically of high computational complexity. In this paper, we explore and represent variable independence in learning conditional probability tables (CPTs), instead of in learning structure. A full Bayesian network is used as the structure and a decision tree is learned for each CPT. The resulting model is called full Bayesian network classifiers (FBCs). In learning an FBC, learning the decision trees for CPTs captures essentially both variable independence and context-specific independence. We present a novel, efficient decision tree learning, which is also effective in the context of FBC learning. In our experiments, the FBC learning algorithm demonstrates better performance in both classification and ranking compared with other state-of-the-art learning algorithms. In addition, its reduced effort on structure learning makes its time complexity quite low as well.
Journal of Experimental and Theoretical Artificial Intelligence | 2008
Harry Zhang; Jiang Su
It is well known that naive Bayes performs surprisingly well in classification, but its probability estimation is poor. AUC (the area under the receiver operating characteristics curve) is a measure different from classification accuracy and probability estimation, which is often used to measure the quality of rankings. Indeed, an accurate ranking of examples is often more desirable than a mere classification. What is the general performance of naive Bayes in yielding optimal ranking, measured by AUC? In this paper, we study it systematically by both empirical experiments and theoretical analysis. In our experiments, we compare naive Bayes with a state-of-the-art decision-tree learning algorithm C4.4 for ranking, and some popular extensions of naive Bayes which achieve a significant improvement over naive Bayes in classification, such as the selective Bayesian classifier (SBC) and tree-augmented naive Bayes (TAN). Our experimental results show that naive Bayes performs significantly better than C4.4 and comparably with TAN. This provides empirical evidence that naive Bayes performs well in ranking. Then we analyse theoretically the optimality of naive Bayes in ranking. We study two example problems: conjunctive concepts and m-of-n concepts, which have been used in analysing the performance of naive Bayes in classification. Surprisingly, naive Bayes performs optimally on them in ranking, even though it does not in classification. We present and prove a sufficient condition for the optimality of naive Bayes in ranking. From both empirical and theoretical studies, we believe that naive Bayes is a competitive model for ranking. A preliminary version of this paper appeared in ECML2004
international conference on machine learning | 2005
Harry Zhang; Liangxiao Jiang; Jiang Su
Naive Bayes is an effective and efficient learning algorithm in classification. In many applications, however, an accurate ranking of instances based on the class probability is more desirable. Unfortunately, naive Bayes has been found to produce poor probability estimates. Numerous techniques have been proposed to extend naive Bayes for better classification accuracy, of which selective Bayesian classifiers (SBC) (Langley & Sage, 1994), tree-augmented naive Bayes (TAN) (Friedman et al., 1997), NBTree (Kohavi, 1996), boosted naive Bayes (Elkan, 1997), and AODE (Webb et al., 2005) achieve remarkable improvement over naive Bayes in terms of classification accuracy. An interesting question is: Do these techniques also produce accurate ranking? In this paper, we first conduct a systematic experimental study on their efficacy for ranking. Then, we propose a new approach to augmenting naive Bayes for generating accurate ranking, called hidden naive Bayes (HNB). In an HNB, a hidden parent is created for each attribute to represent the influences from all other attributes, and thus a more accurate ranking is expected. HNB inherits the structural simplicity of naive Bayes and can be easily learned without structure learning. Our experiments show that HNB outperforms naive Bayes, SBC, boosted naive Bayes, NBTree, and TAN significantly, and performs slightly better than AODE in ranking.
database systems for advanced applications | 2005
Liangxiao Jiang; Harry Zhang; Zhihua Cai; Jiang Su
Naive Bayes has been widely used in data mining as a simple and effective classification algorithm. Since its conditional independence assumption is rarely true, numerous algorithms have been proposed to improve naive Bayes, among which tree augmented naive Bayes (TAN) [3] achieves a significant improvement in term of classification accuracy, while maintaining efficiency and model simplicity. In many real-world data mining applications, however, an accurate ranking is more desirable than a classification. Thus it is interesting whether TAN also achieves significant improvement in term of ranking, measured by AUC(the area under the Receiver Operating Characteristics curve) [8,1]. Unfortunately, our experiments show that TAN performs even worse than naive Bayes in ranking. Responding to this fact, we present a novel learning algorithm, called forest augmented naive Bayes (FAN), by modifying the traditional TAN learning algorithm. We experimentally test our algorithm on all the 36 data sets recommended by Weka [12], and compare it to naive Bayes, SBC [6], TAN [3], and C4.4 [10], in terms of AUC. The experimental results show that our algorithm outperforms all the other algorithms significantly in yielding accurate rankings. Our work provides an effective and efficient data mining algorithm for applications in which an accurate ranking is required.
Pattern Recognition Letters | 2006
Harry Zhang; Jiang Su
Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees produce good classification accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, a probability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-based algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learning algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an effective model and algorithm for applications in which an accurate ranking is required.
canadian conference on artificial intelligence | 2005
Liangxiao Jiang; Harry Zhang; Jiang Su
The instance-based k-nearest neighbor algorithm (KNN)[1] is an effective classification model Its classification is simply based on a vote within the neighborhood, consisting of k nearest neighbors of the test instance Recently, researchers have been interested in deploying a more sophisticated local model, such as naive Bayes, within the neighborhood It is expected that there are no strong dependences within the neighborhood of the test instance, thus alleviating the conditional independence assumption of naive Bayes Generally, the smaller size of the neighborhood (the value of k), the less chance of encountering strong dependences When k is small, however, the training data for the local naive Bayes is small and its classification would be inaccurate In the currently existing models, such as LWNB [3], a relatively large k is chosen The consequence is that strong dependences seem unavoidable. In our opinion, a small k should be preferred in order to avoid strong dependences We propose to deal with the problem of lack of local training data using sampling (cloning) Given a test instance, clones of each instance in the neighborhood is generated in terms of its similarity to the test instance and added to the local training data Then, the local naive Bayes is trained from the expanded training data Since a relatively small k is chosen, the chance of encountering strong dependences within the neighborhood is small Thus the classification of the resulting local naive Bayes would be more accurate We experimentally compare our new algorithm with KNN and its improved variants in terms of classification accuracy, using the 36 UCI datasets recommended by Weka [8], and the experimental results show that our algorithm outperforms all those algorithms significantly and consistently at various k values.
advanced data mining and applications | 2005
Liangxiao Jiang; Harry Zhang; Jiang Su
Accurate probability-based ranking of instances is crucial in many real-world data mining applications. KNN (k-nearest neighbor) [1] has been intensively studied as an effective classification model in decades. However, its performance in ranking is unknown. In this paper, we conduct a systematic study on the ranking performance of KNN. At first, we compare KNN and KNNDW (KNN with distance weighted) to decision trees and naive Bayes in ranking, measured by AUC (the area under the Receiver Operating Characteristics curve). Then, we propose to improve the ranking performance of KNN by combining KNN with naive Bayes. The idea is that a naive Bayes is learned using the k nearest neighbors of the test instance as the training data and used to classify the test instance. A critical problem in combining KNN with naive Bayes is the lack of training data when k is small. We propose to deal with it using sampling to expand the training data. That is, each of the k nearest neighbors is “cloned” and the clones are added to the training data. We call our new model instance cloning local naive Bayes (simply ICLNB). We conduct extensive empirical comparison for the related algorithms in two groups in terms of AUC, using the 36 UCI datasets recommended by Weka[2]. In the first group, we compare ICLNB with other types of algorithms C4.4[3], naive Bayes and NBTree[4]. In the second group, we compare ICLNB with KNN, KNNDW and LWNB[5]. Our experimental results show that ICLNB outperforms all those algorithms significantly. From our study, we have two conclusions. First, KNN-relates algorithms performs well in ranking. Second, our new algorithm ICLNB performs best among the algorithms compared in this paper, and could be used in the applications in which an accurate ranking is desired.
european conference on machine learning | 2004
Harry Zhang; Jiang Su
It has been observed that traditional decision trees produce poor probability estimates. In many applications, however, a probability estimation tree (PET) with accurate probability estimates is desirable. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. Indeed, the representation of decision trees is fully expressive theoretically, but it is often impractical to learn such a representation with accurate probability estimates from limited training data. In this paper, we extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for PETs. We propose a novel algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms C4.5 and naive Bayes significantly in classification accuracy.