Xiuzhen Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiuzhen Zhang is active.

Explore More

Publication

Featured researches published by Xiuzhen Zhang.

discovery science | 1999

CAEP: Classification by Aggregating Emerging Patterns

Guozhu Dong; Xiuzhen Zhang; Limsoon Wong; Jinyan Li

Emerging patterns (EPs) are itemsets whose supports change significantly from one dataset to another; they were recently proposed to capture multi-attribute contrasts between data classes, or trends over time. In this paper we propose a new classifier, CAEP, using the following main ideas based on EPs: (i) Each EP can sharply differentiate the class membership of a (possibly small) fraction of instances containing the EP, due to the big difference between its supports in the opposing classes; we define the differentiating power of the EP in terms of the supports and their ratio, on instances containing the EP. (ii) For each instance t, by aggregating the differentiating power of a fixed, automatically selected set of EPs, a score is obtained for each class. The scores for all classes are normalized and the largest score determines ts class. CAEP is suitable for many applications, even those with large volumes of high (e.g. 45) dimensional data; it does not depend on dimension reduction on data; and it is usually equally accurate on all classes even if their populations are unbalanced. Experiments show that CAEP has consistent good predictive accuracy, and it almost always outperforms C4.5 and CBA. By using efficient, border-based algorithms (developed elsewhere) to discover EPs, CAEP scales up on data volume and dimensionality. Observing that accuracy on the whole dataset is too coarse description of classifiers, we also used a more accurate measure, sensitivity and precision, to better characterize the performance of classifiers. CAEP is also very good under this measure.

european conference on principles of data mining and knowledge discovery | 1999

Efficient Mining of High Confidience Association Rules without Support Thresholds

Jinyan Li; Xiuzhen Zhang; Guozhu Dong; Kotagiri Ramamohanarao; Qun Sun

Association rules describe the degree of dependence between items in transactional datasets by their confidences. In this paper, we first introduce the problem of mining top rules, namely those association rules with 100% confidence. Traditional approaches to this problem need a minimum support (minsup) threshold and then can discover the top rules with supports ≥ minsup; such approaches, however, rely on minsup to help avoid examining too many candidates and they miss those top rules whose supports are below minsup. The low support top rules (e.g. some unusual combinations of some factors that have always caused some disease) may be very interesting. Fundamentally different from previous work, our proposed method uses a dataset partitioning technique and two border-based algorithms to efficiently discover all top rules with a given consequent, without the constraint of support threshold. Importantly, we use borders to concisely represent all top rules, instead of enumerating them individually. We also discuss how to discover all zero-confidence rules and some very high (say 90%) confidence rules using approaches similar to mining top rules. Experimental results using the Mushroom, the Cleveland heart disease, and the Boston housing datasets are reported to evaluate the efficiency of the proposed approach.

BMC Bioinformatics | 2009

Large-scale prediction of long disordered regions in proteins using random forests

Pengfei Han; Xiuzhen Zhang; Raymond S. Norton; Zhi-Ping Feng

BackgroundMany proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.ResultsA new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.ConclusionThe random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php

intelligent data engineering and automated learning | 2000

Information-Based Classification by Aggregating Emerging Patterns

Xiuzhen Zhang; Guozhu Dong; Kotagiri Ramamohanarao

Emerging patterns (EPs) are knowledge patterns capturing contrasts between data classes. In this paper, we propose an information-based approach for classification by aggregating emerging patterns. The constraint-based EP mining algorithm enables the system to learn from large-volume and high-dimensional data; the new approach for selecting representative EPs and efficient algorithm for finding the EPs renders the system high predictive accuracy and short classification time. Experiments on many benchmark datasets show that the resulting classifiers have good overall predictive accuracy, and are often also superior to other state-of-the-art classification systems such as C4.5, CBA and LB.

knowledge discovery and data mining | 2011

Improving k nearest neighbor with exemplar generalization for imbalanced classification

Yuxuan Li; Xiuzhen Zhang

A k nearest neighbor (kNN) classifier classifies a query instance to the most frequent class of its k nearest neighbors in the training instance space. For imbalanced class distribution, a query instance is often overwhelmed by majority class instances in its neighborhood and likely to be classified to the majority class. We propose to identify exemplar minority class training instances and generalize them to Gaussian balls as concepts for the minority class. Our k Exemplar-based Nearest Neighbor (kENN) classifier is therefore more sensitive to the minority class. Extensive experiments show that kENN significantly improves the performance of kNN and also outperforms popular re-sampling and costsensitive learning strategies for imbalanced classification.

BMC Bioinformatics | 2009

Predicting disordered regions in proteins using the profiles of amino acid indices

Pengfei Han; Xiuzhen Zhang; Zhi-Ping Feng

BackgroundIntrinsically unstructured or disordered proteins are common and functionally important. Prediction of disordered regions in proteins can provide useful information for understanding protein function and for high-throughput determination of protein structures.ResultsIn this paper, algorithms are presented to predict long and short disordered regions in proteins, namely the long disordered region prediction algorithm DRaai-L and the short disordered region prediction algorithm DRaai-S. These algorithms are developed based on the Random Forest machine learning model and the profiles of amino acid indices representing various physiochemical and biochemical properties of the 20 amino acids.ConclusionExperiments on DisProt3.6 and CASP7 demonstrate that some sets of the amino acid indices have strong association with the ordered and disordered status of residues. Our algorithms based on the profiles of these amino acid indices as input features to predict disordered regions in proteins outperform that based on amino acid composition and reduced amino acid composition, and also outperform many existing algorithms. Our studies suggest that the profiles of amino acid indices combined with the Random Forest learning model is an important complementary method for pinpointing disordered regions in proteins.

IEEE Transactions on Knowledge and Data Engineering | 2007

Efficient Computation of Iceberg Cubes by Bounding Aggregate Functions

Xiuzhen Zhang; Pauline Lienhua Chou; Guozhu Dong

The iceberg cubing problem is to compute the multidimensional group-by partitions that satisfy given aggregation constraints. Pruning unproductive computation for iceberg cubing when nonantimonotone constraints are present is a great challenge because the aggregate functions do not increase or decrease monotonically along the subset relationship between partitions. In this paper, we propose a novel bound prune cubing (BP-Cubing) approach for iceberg cubing with nonantimonotone aggregation constraints. Given a cube over n dimensions, an aggregate for any group-by partition can be computed from aggregates for the most specific n--dimensional partitions (MSPs). The largest and smallest aggregate values computed this way become the bounds for all partitions in the cube. We provide efficient methods to compute tight bounds for base aggregate functions and, more interestingly, arithmetic expressions thereof, from bounds of aggregates over the MSPs. Our methods produce tighter bounds than those obtained by previous approaches. We present iceberg cubing algorithms that combine bounding with efficient aggregation strategies. Our experiments on real-world and artificial benchmark data sets demonstrate that BP-Cubing algorithms achieve more effective pruning and are several times faster than state-of-the-art iceberg cubing algorithms and that BP-Cubing achieves the best performance with the top-down cubing approach.

Journal of Computational Biology | 2006

Predicting Disordered Regions in Proteins Based on Decision Trees of Reduced Amino Acid Composition

Pengfei Han; Xiuzhen Zhang; Raymond S. Norton; Zhi-Ping Feng

Intrinsically unstructured proteins (IUPs) are proteins lacking a fixed three dimensional structure or containing long disordered regions. IUPs play an important role in biology and disease. Identifying disordered regions in protein sequences can provide useful information on protein structure and function, and can assist high-throughput protein structure determination. In this paper we present a system for predicting disordered regions in proteins based on decision trees and reduced amino acid composition. Concise rules based on biochemical properties of amino acid side chains are generated for prediction. Coarser information extracted from the composition of amino acids can not only improve the prediction accuracy but also increase the learning efficiency. In cross-validation tests, with four groups of reduced amino acid composition, our system can achieve a recall of 80% at a 13% false positive rate for predicting disordered regions, and the overall accuracy can reach 83.4%. This prediction accuracy is comparable to most, and better than some, existing predictors. Advantages of our approach are high prediction accuracy for long disordered regions and efficiency for large-scale sequence analysis. Our software is freely available for academic use upon request.

asia pacific web conference | 2004

Efficient frequent pattern mining on web logs

Liping Sun; Xiuzhen Zhang

Mining frequent patterns from Web logs is an important data mining task. Candidate-generation-and-test and pattern-growth are two representative frequent pattern mining approaches. We have conducted extensive experiments on real world Web log data to analyse the characteristics of Web logs and the behaviours of these two approaches on Web logs. To improve the performance of current algorithms on mining Web logs, we propose a new algorithm – Combined Frequent Pattern Mining (CFPM) to cater for Web log data specifically. We use heuristics to prune search space and reduce costs in mining so that better efficiency is achieved. Experimental results show that CFPM significantly improves the performance of the pattern-growth approach by 1.2–7.8 times on mining frequent patterns from Web logs.

australasian joint conference on artificial intelligence | 2003

Noise Tolerance of EP-Based Classifiers

Qun Sun; Xiuzhen Zhang; Kotagiri Ramamohanarao

Emerging Pattern (EP)-based classifiers are a type of new classifiers based on itemsets whose occurrence in one dataset varies significantly from that of another. These classifiers are very promising and have shown to perform comparably with some popular classifiers. In this paper, we conduct two experiments to study the noise tolerance of EP-based classifiers. A primary concern is to ascertain if overfitting occurs in them. Our results highlight the fact that the aggregating approach in constructing EP-based classifiers prevents them from overfitting. We further conclude that perfect training accuracy does not necessarily lead to overfitting of a classifier as long as there exists a suitable mechanism, such as an aggregating approach, to counterbalance any propensity to overfit.

Explore More