C. Z. Cai
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by C. Z. Cai.
Nucleic Acids Research | 2003
C. Z. Cai; L. Y. Han; Zhi Liang Ji; Xi Chen; Yu Zong Chen
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Proteins | 2004
C. Z. Cai; L. Y. Han; Zhi Liang Ji; Yu Zong Chen
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G‐protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non‐enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi‐class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi‐bin/svmprot.cgi. Proteins 2004.
Bellman Prize in Mathematical Biosciences | 2003
C. Z. Cai; Wan-lu Wang; Li Zhi Sun; Yu Zong Chen
Support vector machine (SVM) is introduced as a method for the classification of proteins into functionally distinguished classes. Studies are conducted on a number of protein classes including RNA-binding proteins; protein homodimers, proteins responsible for drug absorption, proteins involved in drug distribution and excretion, and drug metabolizing enzymes. Testing accuracy for the classification of these protein classes is found to be in the range of 84-96%. This suggests the usefulness of SVM in the classification of protein functional classes and its potential application in protein function prediction.
Proteins | 2005
Honghuang Lin; L. Y. Han; C. Z. Cai; Zhi Liang Ji; Yu Zong Chen
Transporters play key roles in cellular transport and metabolic processes, and in facilitating drug delivery and excretion. These proteins are classified into families based on the transporter classification (TC) system. Determination of the TC family of transporters facilitates the study of their cellular and pharmacological functions. Methods for predicting TC family without sequence alignments or clustering are particularly useful for studying novel transporters whose function cannot be determined by sequence similarity. This work explores the use of a machine learning method, support vector machines (SVMs), for predicting the family of transporters from their sequence without the use of sequence similarity. A total of 10,636 transporters in 13 TC subclasses, 1914 transporters in eight TC families, and 168,341 nontransporter proteins are used to train and test the SVM prediction system. Testing results by using a separate set of 4351 transporters and 83,151 nontransporter proteins show that the overall accuracy for predicting members of these TC subclasses and families is 83.4% and 88.0%, respectively, and that of nonmembers is 99.3% and 96.6%, respectively. The accuracies for predicting members and nonmembers of individual TC subclasses are in the range of 70.7–96.1% and 97.6–99.9%, respectively, and those of individual TC families are in the range of 60.6–97.1% and 91.5–99.4%, respectively. A further test by using 26,139 transmembrane proteins outside each of the 13 TC subclasses shows that 90.4–99.6% of these are correctly predicted. Our study suggests that the SVM is potentially useful for facilitating functional study of transporters irrespective of sequence similarity. Proteins 2006.
International Journal of Modern Physics C | 2003
C. Z. Cai; Wan-lu Wang; Yu Zong Chen
The support vector machine (SVM) is used in the classification of sonar signals and DNA-binding proteins. Our study on the classification of sonar signals shows that SVM produces a result better than that obtained from other classification methods, which is consistent from the findings of other studies. The testing accuracy of classification is 95.19% as compared with that of 90.4% from multilayered neural network and that of 82.7% from nearest neighbor classifier. From our results on the classification of DNA-binding proteins, one finds that SVM gives a testing accuracy of 82.32%, which is slightly better than that obtained from an earlier study of SVM classification of protein–protein interactions. Hence, our study indicates the usefulness of SVM in the identification of DNA-binding proteins. Further improvements in SVM algorithm and parameters are suggested.
Virology | 2005
L. Y. Han; C. Z. Cai; Zhi Liang Ji; Yu Zong Chen
Abstract The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi) to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692–3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72% of these proteins match the literature-described function, which is compared to the overall accuracy of 87% for SVMProt functional class assignment of 34582 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins.
The American Journal of Chinese Medicine | 2005
J. F. Wang; C. Z. Cai; C.Y. Kong; Z. W. Cao; Yu Zong Chen
Traditional Chinese medicine (TCM) has been widely practiced and is considered as an alternative to conventional medicine. TCM herbal prescriptions contain a mixture of herbs that collectively exert therapeutic actions and modulating effects. Traditionally defined herbal properties, related to the pharmacodynamic, pharmacokinetic and toxicological, as well as physicochemical properties of their principal ingredients, have been used as the basis for formulating TCM multi-herb prescriptions. These properties are used in this work to develop a computer program for predicting whether a multi-herb recipe is a valid TCM prescription. This program is based on a statistical learning method, support vector machine (SVM), and it is trained by using 575 well-known TCM prescriptions and 1961 non-TCM recipes generated by random combination of TCM herbs. Testing results by using 72 well-known TCM prescriptions and 5039 non-TCM recipes showed that 73.6% of the TCM prescriptions and 99.9% of non-TCM recipes are correctly classified by this system. A further test by using 48 TCM prescriptions published in recent years found that 68.7% of these are correctly classified. These accuracies are comparable to those of SVM classification of other biological systems. Our study indicates the potential of SVM for facilitating the analysis of TCM prescriptions.
Journal of Molecular Microbiology and Biotechnology | 2005
Juan Cui; L. Y. Han; C. Z. Cai; C. J. Zheng; Zhi Liang Ji; Yu Zong Chen
A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692–3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.
International Journal of Modern Physics C | 2006
Hanguang Xiao; C. Z. Cai; Yu Zong Chen
It is a difficult and important task to classify the types of military vehicles using the acoustic and seismic signals generated by military vehicles. For improving the classification accuracy and reducing the computing time and memory size, we investigated different pre-processing technology, feature extraction and selection methods. Short Time Fourier Transform (STFT) was employed for feature extraction. Genetic Algorithms (GA) and Principal Component Analysis (PCA) were used for feature selection and extraction further. A new feature vector construction method was proposed by uniting PCA and another feature selection method. K-Nearest Neighbor Classifier (KNN) and Support Vector Machines (SVM) were used for classification. The experimental results showed the accuracies of KNN and SVM were affected obviously by the window size which was used to frame the time series of the acoustic and seismic signals. The classification results indicated the performance of SVM was superior to that of KNN. The comparison of the four feature selection and extraction methods showed the proposed method is a simple, none time-consuming, and reliable technique for feature selection and helps the classifier SVM to achieve more better results than solely using PCA, GA, or combination.
Nucleic Acids Research | 2004
L. Y. Han; C. Z. Cai; Zhi Liang Ji; Z. W. Cao; Juan Cui; Yu Zong Chen