Tonghua Li
Tongji University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tonghua Li.
Chemometrics and Intelligent Laboratory Systems | 2002
Kailin Tang; Tonghua Li
Abstract In this paper, a new algorithm, the partial least squares (PLS) improved by genetic algorithm–genetic programming (GA-GP) is applied to deal with functions for inner relationship in quantitative structure–activity relationship (QSAR). PLS is used to build a linear or nonlinear model between the principal components and its activity, and GA-GP is applied to regressions and equations. It develops PLS models to increase the range of PLS modeling. Using the inner relationship of polynomial function in this paper, a set of 79 inhibitors of HIV-1 reverse transcriptase, derivatives of a recently reported HIV-1-specific lead: 1-[(2-hydroxyethoxy) methyl]-6-(phenylthio) thymine (HEPT) was studied. The obtained QSAR model shows high predictive ability, r cv =0.900. It demonstrates that this method is useful.
BMC Bioinformatics | 2010
Kailin Tang; Tonghua Li; Wenwei Xiong; Kai Chen
BackgroundRecent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.ResultsWe propose a method based on statistical moments to reduce feature dimensions. After refining and t-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.ConclusionThe proposed method is suitable for analyzing high-throughput proteomics data.
Talanta | 2005
Hongtao Gao; Tonghua Li; Kai Chen; Wei-Guang Li; Xian Bi
Non-negative matrix factorization (NMF), with the constraints of non-negativity, has been recently proposed for multi-variate data analysis. Because it allows only additive, not subtractive, combinations of the original data, NMF is capable of producing region or parts-based representation of objects. It has been used for image analysis and text processing. Unlike PCA, the resolutions of NMF are non-negative and can be easily interpreted and understood directly. Due to multiple solutions, the original algorithm of NMF [D.D. Lee, H.S. Seung, Nature 401 (1999) 788] is not suitable for resolving chemical mixed signals. In reality, NMF has never been applied to resolving chemical mixed signals. It must be modified according to the characteristics of the chemical signals, such as smoothness of spectra, unimodality of chromatograms, sparseness of mass spectra, etc. We have used the modified NMF algorithm to narrow the feasible solution region for resolving chemical signals, and found that it could produce reasonable and acceptable results for certain experimental errors, especially for overlapping chromatograms and sparse mass spectra. Simulated two-dimensional (2-D) data and real GUJINGGONG alcohol liquor GC-MS data have been resolved soundly by NMF technique. Butyl caproate and its isomeric compound (butyric acid, hexyl ester) have been identified from the overlapping spectra. The result of NMF is preferable to that of Heuristic evolving latent projections (HELP). It shows that NMF is a promising chemometric resolution method for complex samples.
Bioinformatics | 2012
Dapeng Li; Tonghua Li; Peisheng Cong; Wenwei Xiong; Jiangming Sun
MOTIVATION The precise prediction of protein secondary structure is of key importance for the prediction of 3D structure and biological function. Although the development of many excellent methods over the last few decades has allowed the achievement of prediction accuracies of up to 80%, progress seems to have reached a bottleneck, and further improvements in accuracy have proven difficult. RESULTS We propose for the first time a structural position-specific scoring matrix (SPSSM), and establish an unprecedented database of 9 million sequences and their SPSSMs. This database, when combined with a purpose-designed BLAST tool, provides a novel prediction tool: SPSSMPred. When the SPSSMPred was validated on a large dataset (10,814 entries), the Q3 accuracy of the protein secondary structure prediction was 93.4%. Our approach was tested on the two latest EVA sets; accuracies of 82.7 and 82.0% were achieved, far higher than can be achieved using other predictors. For further evaluation, we tested our approach on newly determined sequences (141 entries), and obtained an accuracy of 89.6%. For a set of low-homology proteins (40 entries), the SPSSMPred still achieved a Q3 value of 84.6%. AVAILABILITY The SPSSMPred server is available at http://cal.tongji.edu.cn/SPSSMPred/ CONTACT [email protected]
BMC Bioinformatics | 2011
Zehui Tang; Tonghua Li; Rida Liu; Wenwei Xiong; Jiangming Sun; Yaojuan Zhu; Guanyan Chen
BACKGROUND The β-turn is a secondary protein structure type that plays an important role in protein configuration and function. Development of accurate prediction methods to identify β-turns in protein sequences is valuable. Several methods for β-turn prediction have been developed; however, the prediction quality is still a challenge and there is substantial room for improvement. Innovations of the proposed method focus on discovering effective features, and constructing a new architectural model. RESULTS We utilized predicted secondary structures, predicted shape strings and the position-specific scoring matrix (PSSM) as input features, and proposed a novel two-layer model to enhance the prediction. We achieved the highest values according to four evaluation measures, i.e. Q(total) = 87.2%, MCC = 0.66, Q(observed) = 75.9%, and Q(predicted) = 73.8% on the BT426 dataset. The results show that our proposed two-layer model discriminates better between β-turns and non-β-turns than the single model due to obtaining higher Q(predicted). Moreover, the predicted shape strings based on the structural alignment approach greatly improve the performance, and the same improvements were observed on BT547 and BT823 datasets as well. CONCLUSION In this article, we present a comprehensive method for the prediction of β-turns. Experiments show that the proposed method constitutes a great improvement over the competing prediction methods.
Nucleic Acids Research | 2009
Wenwei Xiong; Tonghua Li; Kai Chen; Kailin Tang
Sequence-based approach for motif prediction is of great interest and remains a challenge. In this work, we develop a local combinational variable approach for sequence-based helix-turn-helix (HTH) motif prediction. First we choose a sequence data set for 88 proteins of 22 amino acids in length to launch an optimized traversal for extracting local combinational segments (LCS) from the data set. Then after LCS refinement, local combinational variables (LCV) are generated to construct prediction models for HTH motifs. Prediction ability of LCV sets at different thresholds is calculated to settle a moderate threshold. The large data set we used comprises 13 HTH families, with 17 455 sequences in total. Our approach predicts HTH motifs more precisely using only primary protein sequence information, with 93.29% accuracy, 93.93% sensitivity and 92.66% specificity. Prediction results of newly reported HTH-containing proteins compared with other prediction web service presents a good prediction model derived from the LCV approach. Comparisons with profile-HMM models from the Pfam protein families database show that the LCV approach maintains a good balance while dealing with HTH-containing proteins and non-HTH proteins at the same time. The LCV approach is to some extent a complementary to the profile-HMM models for its better identification of false-positive data. Furthermore, genome-wide predictions detect new HTH proteins in both Homo sapiens and Escherichia coli organisms, which enlarge applications of the LCV approach. Software for mining LCVs from sequence data set can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/LCV/freely.
Nucleic Acids Research | 2012
Jiangming Sun; Shengnan Tang; Wenwei Xiong; Peisheng Cong; Tonghua Li
Many studies have demonstrated that shape string is an extremely important structure representation, since it is more complete than the classical secondary structure. The shape string provides detailed information also in the regions denoted random coil. But few services are provided for systematic analysis of protein shape string. To fill this gap, we have developed an accurate shape string predictor based on two innovative technologies: a knowledge-driven sequence alignment and a sequence shape string profile method. The performance on blind test data demonstrates that the proposed method can be used for accurate prediction of protein shape string. The DSP server provides both predicted shape string and sequence shape string profile for each query sequence. Using this information, the users can compare protein structure or display protein evolution in shape string space. The DSP server is available at both http://cheminfo.tongji.edu.cn/dsp/ and its main mirror http://chemcenter.tongji.edu.cn/dsp/.
Chemometrics and Intelligent Laboratory Systems | 1999
Tonghua Li; He Mei; Peisheng Cong
Abstract In this paper, a new algorithm, the nonlinear PLS improved by the numeric genetic algorithm, called NPLSNGA, is applied to deal with nonlinear functions for inner relationship in QSAR. The NGA is used twice in NPLSNGA, once for nonlinear regression, and the other use is for nonlinear equations. Using the inner relationship of quadratic polynomial function, the fungicidal activity of a series of O -ethyl- N -isopropylphosphoro (thioureido) thioates was studied. The results are superior to the results of the reference. In QSAR of carboquinon derivatives and an anticarcinogenic drug for clinical media, the inner relation of sigmoid function was used. The results are equivalent to the results of ANN.
Nucleic Acids Research | 2013
Shengnan Tang; Tonghua Li; Peisheng Cong; Wenwei Xiong; Zhiheng Wang; Jiangming Sun
Knowledge of subcellular localizations (SCLs) of plant proteins relates to their functions and aids in understanding the regulation of biological processes at the cellular level. We present PlantLoc, a highly accurate and fast webserver for predicting the multi-label SCLs of plant proteins. The PlantLoc server has two innovative characters: building localization motif libraries by a recursive method without alignment and Gene Ontology information; and establishing simple architecture for rapidly and accurately identifying plant protein SCLs without a machine learning algorithm. PlantLoc provides predicted SCLs results, confidence estimates and which is the substantiality motif and where it is located on the sequence. PlantLoc achieved the highest accuracy (overall accuracy of 80.8%) of identification of plant protein SCLs as benchmarked by using a new test dataset compared other plant SCL prediction webservers. The ability of PlantLoc to predict multiple sites was also significantly higher than for any other webserver. The predicted substantiality motifs of queries also have great potential for analysis of relationships with protein functional regions. The PlantLoc server is available at http://cal.tongji.edu.cn/PlantLoc/.
Amino Acids | 2012
Yaojuan Zhu; Tonghua Li; Dapeng Li; Yun Zhang; Wenwei Xiong; Jiangming Sun; Zehui Tang; Guanyan Chen
Numerous methods for predicting γ-turns in proteins have been developed. However, the results they generally provided are not very good, with a Matthews correlation coefficient (MCC) ≤0.18. Here, an attempt has been made to develop a method to improve the accuracy of γ-turn prediction. First, we employ the geometric mean metric as optimal criterion to evaluate the performance of support vector machine for the highly imbalanced γ-turn dataset. This metric tries to maximize both the sensitivity and the specificity while keeping them balanced. Second, a predictor to generate protein shape string by structure alignment against the protein structure database has been designed and the predicted shape string is introduced as new variable for γ-turn prediction. Based on this perception, we have developed a new method for γ-turn prediction. After training and testing the benchmark dataset of 320 non-homologous protein chains using a fivefold cross-validation technique, the present method achieves excellent performance. The overall prediction accuracy Qtotal can achieve 92.2% and the MCC is 0.38, which outperform the existing γ-turn prediction methods. Our results indicate that the protein shape string is useful for predicting protein tight turns and it is reasonable to use the dihedral angle information as a variable for machine learning to predict protein folding. The dataset used in this work and the software to generate predicted shape string from structure database can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/GammaTurnPrediction/ freely.