Harry Zhang
University of New Brunswick
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harry Zhang.
canadian conference on artificial intelligence | 2003
Charles X. Ling; Jin Huang; Harry Zhang
Predictive accuracy has been widely used as the main criterion for comparing the predictive ability of classification systems (such as C4.5, neural networks, and Naive Bayes). Most of these classifiers also produce probability estimations of the classification, but they are completely ignored in the accuracy measure. This is often taken for granted because both training and testing sets only provide class labels. In this paper we establish rigourously that, even in this setting, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, provides a better measure than accuracy. Our result is quite significant for three reasons. First, we establish, for the first time, rigourous criteria for comparing evaluation measures for learning algorithms. Second, it suggests that AUC should replace accuracy when measuring and comparing classification systems. Third, our result also prompts us to reevaluate many well-established conclusions based on accuracy in machine learning. For example, it is well accepted in the machine learning community that, in terms of predictive accuracy, Naive Bayes and decision trees are very similar. Using AUC, however, we show experimentally that Naive Bayes is significantly better than the decision-tree learning algorithms.
IEEE Transactions on Knowledge and Data Engineering | 2009
Liangxiao Jiang; Harry Zhang; Zhihua Cai
Because learning an optimal Bayesian network classifier is an NP-hard problem, learning-improved naive Bayes has attracted much attention from researchers. In this paper, we summarize the existing improved algorithms and propose a novel Bayes model: hidden naive Bayes (HNB). In HNB, a hidden parent is created for each attribute which combines the influences from all other attributes. We experimentally test HNB in terms of classification accuracy, using the 36 UCI data sets selected by Weka, and compare it to naive Bayes (NB), selective Bayesian classifiers (SBC), naive Bayes tree (NBTree), tree-augmented naive Bayes (TAN), and averaged one-dependence estimators (AODE). The experimental results show that HNB significantly outperforms NB, SBC, NBTree, TAN, and AODE. In many data mining applications, an accurate class probability estimation and ranking are also desirable. We study the class probability estimation and ranking performance, measured by conditional log likelihood (CLL) and the area under the ROC curve (AUC), respectively, of naive Bayes and its improved models, such as SBC, NBTree, TAN, and AODE, and then compare HNB to them in terms of CLL and AUC. Our experiments show that HNB also significantly outperforms all of them.
international conference on machine learning | 2008
Jiang Su; Harry Zhang; Charles X. Ling; Stan Matwin
Bayesian network classifiers have been widely used for classification problems. Given a fixed Bayesian network structure, parameters learning can take two different approaches: generative and discriminative learning. While generative parameter learning is more efficient, discriminative parameter learning is more effective. In this paper, we propose a simple, efficient, and effective discriminative parameter learning method, called Discriminative Frequency Estimate (DFE), which learns parameters by discriminatively computing frequencies from data. Empirical studies show that the DFE algorithm integrates the advantages of both generative and discriminative learning: it performs as well as the state-of-the-art discriminative parameter learning method ELR in accuracy, but is significantly more efficient.
international conference on data mining | 2004
Harry Zhang; Shengli Sheng
Naive Bayes is one of most effective classification algorithms. In many applications, however, a ranking of examples are more desirable than just classification. How to extend naive Bayes to improve its ranking performance is an interesting and useful question in practice. Weighted naive Bayes is an extension of naive Bayes, in which attributes have different weights. This paper investigates how to learn a weighted naive Bayes with accurate ranking from data, or more precisely, how to learn the weights of a weighted naive Bayes to produce accurate ranking. We explore various methods: the gain ratio method, the hill climbing method, and the Markov chain Monte Carlo method, the hill climbing method combined with the gain ratio method, and the Markov chain Monte Carlo method combined with the gain ratio method. Our experiments show that a weighted naive Bayes trained to produce accurate ranking outperforms naive Bayes.
Applied Soft Computing | 2011
Zhihua Cai; Wenyin Gong; Charles X. Ling; Harry Zhang
Hybridization with other different algorithms is an interesting direction for the improvement of differential evolution (DE). In this paper, a hybrid DE based on the one-step k-means clustering, called clustering-based DE (CDE), is presented for the unconstrained global optimization problems. The one-step k-means clustering acts as several multi-parent crossover operators to utilize the information of the population efficiently, and hence it can enhance the performance of DE. To validate the performance of our approach, 30 benchmark functions of a wide range of dimensions and diversity complexities are employed. Experimental results indicate that our approach is effective and efficient. Compared with other state-of-the-art DE approaches, our approach performs better, or at least comparably, in terms of the quality of the final solutions and the reduction of the number of fitness function evaluations (NFFEs).
International Journal of Pattern Recognition and Artificial Intelligence | 2005
Harry Zhang
Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. Its competitive performance in classification is surprising, because the conditional independence assumption on which it is based is rarely true in real-world applications. An open question is: what is the true reason for the surprisingly good performance of Naive Bayes in classification? In this paper, we propose a novel explanation for the good classification performance of Naive Bayes. We show that, essentially, dependence distribution plays a crucial role. Here dependence distribution means how the local dependence of an attribute distributes in each class, evenly or unevenly, and how the local dependences of all attributes work together, consistently (supporting a certain classification) or inconsistently (canceling each other out). Specifically, we show that no matter how strong the dependences among attributes are, Naive Bayes can still be optimal if the dependences distribute evenly in classes, or if the dependences cancel each other out. We propose and prove a sufficient and necessary condition for the optimality of Naive Bayes. Further, we investigate the optimality of Naive Bayes under the Gaussian distribution. We present and prove a sufficient condition for the optimality of Naive Bayes, in which the dependences among attributes exist. This provides evidence that dependences may cancel each other out. Our theoretic analysis can be used in designing learning algorithms. In fact, a major class of learning algorithms for Bayesian networks are conditional independence-based (or CI-based), which are essentially based on dependence. We design a dependence distribution-based algorithm by extending the ChowLiu algorithm, a widely used CI based algorithm. Our experiments show that the new algorithm outperforms the ChowLiu algorithm, which also provides empirical evidence to support our new explanation.
european conference on machine learning | 2004
Harry Zhang; Jiang Su
It is well-known that naive Bayes performs surprisingly well in classification, but its probability estimation is poor. In many applications, however, a ranking based on class probabilities is desired. For example, a ranking of customers in terms of the likelihood that they buy ones products is useful in direct marketing. What is the general performance of naive Bayes in ranking? In this paper, we study it by both empirical experiments and theoretical analysis. Our experiments show that naive Bayes outperforms C4.4, the most state-of-the-art decision-tree algorithm for ranking. We study two example problems that have been used in analyzing the performance of naive Bayes in classification [3]. Surprisingly, naive Bayes performs perfectly on them in ranking, even though it does not in classification. Finally, we present and prove a sufficient condition for the optimality of naive Bayes in ranking.
theory and applications of satisfiability testing | 2007
Chu Min Li; Wanxia Wei; Harry Zhang
The adaptive noise mechanism was introduced in Novelty+ to automatically adapt noise settings during the search [4]. The local search algorithm G2WSAT deterministically exploits promising decreasing variables to reduce randomness and consequently the dependence on noise parameters. In this paper, we first integrate the adaptive noise mechanism in G2WSAT to obtain an algorithm adaptG2WSAT, whose performance suggests that the deterministic exploitation of promising decreasing variables cooperates well with this mechanism. Then, we propose an approach that uses look-ahead for promising decreasing variables to further reinforce this cooperation. We implement this approach in adaptG2WSAT, resulting in a new local search algorithm called adaptG2WSATP. Without any manual noise or other parameter tuning, adaptG2WSATP shows generally good performance, compared with G2WSAT with approximately optimal static noise settings, or is sometimes even better than G2WSAT. In addition, adaptG2WSATP is favorably compared with state-of-the-art local search algorithms such as R+adaptNovelty+ and VW.
international conference on machine learning | 2006
Jiang Su; Harry Zhang
The structure of a Bayesian network (BN) encodes variable independence. Learning the structure of a BN, however, is typically of high computational complexity. In this paper, we explore and represent variable independence in learning conditional probability tables (CPTs), instead of in learning structure. A full Bayesian network is used as the structure and a decision tree is learned for each CPT. The resulting model is called full Bayesian network classifiers (FBCs). In learning an FBC, learning the decision trees for CPTs captures essentially both variable independence and context-specific independence. We present a novel, efficient decision tree learning, which is also effective in the context of FBC learning. In our experiments, the FBC learning algorithm demonstrates better performance in both classification and ranking compared with other state-of-the-art learning algorithms. In addition, its reduced effort on structure learning makes its time complexity quite low as well.
Knowledge Based Systems | 2012
Liangxiao Jiang; Zhihua Cai; Dianhong Wang; Harry Zhang
Numerous algorithms have been proposed to improve Naive Bayes (NB) by weakening its conditional attribute independence assumption, among which Tree Augmented Naive Bayes (TAN) has demonstrated remarkable classification performance in terms of classification accuracy or error rate, while maintaining efficiency and simplicity. In many real-world applications, however, classification accuracy or error rate is not enough. For example, in direct marketing, we often need to deploy different promotion strategies to customers with different likelihood (class probability) of buying some products. Thus, accurate class probability estimation is often required to make optimal decisions. In this paper, we investigate the class probability estimation performance of TAN in terms of conditional log likelihood (CLL) and present a new algorithm to improve its class probability estimation performance by the spanning TAN classifiers. We call our improved algorithm Averaged Tree Augmented Naive Bayes (ATAN). The experimental results on a large number of UCI datasets published on the main web site of Weka platform show that ATAN significantly outperforms TAN and all the other algorithms used to compare in terms of CLL.