Xiao Sun
Southeast University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiao Sun.
Nucleic Acids Research | 2007
Peng Jiang; Haonan Wu; Wenkai Wang; Wei Ma; Xiao Sun; Zuhong Lu
To distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (pseudo pre-miRNAs), a hybrid feature which consists of local contiguous structure-sequence composition, minimum of free energy (MFE) of the secondary structure and P-value of randomization test is used. Besides, a novel machine-learning algorithm, random forest (RF), is introduced. The results suggest that our method predicts at 98.21% specificity and 95.09% sensitivity. When compared with the previous study, Triplet-SVM-classifier, our RF method was nearly 10% greater in total accuracy. Further analysis indicated that the improvement was due to both the combined features and the RF algorithm. The MiPred web server is available at http://www.bioinf.seu.edu.cn/miRNA/. Given a sequence, MiPred decides whether it is a pre-miRNA-like hairpin sequence or not. If the sequence is a pre-miRNA-like hairpin, the RF classifier will predict whether it is a real pre-miRNA or a pseudo one.
Virus Research | 2004
Wanjun Gu; Tong Zhou; Jianmin Ma; Xiao Sun; Zuhong Lu
Abstract In this study, we calculated the codon usage bias in severe acute respiratory syndrome Coronavirus (SARSCoV) and performed a comparative analysis of synonymous codon usage patterns in SARSCoV and 10 other evolutionary related viruses in the Nidovirales. Although there is a significant variation in codon usage bias among different SARSCoV genes, codon usage bias in SARSCoV is a little slight, which is mainly determined by the base compositions on the third codon position. By comparing synonymous codon usage patterns in different viruses, we observed that synonymous codon usage pattern in these virus genes was virus specific and phylogenetically conserved, but it was not host specific. Phylogenetic analysis based on codon usage pattern suggested that SARSCoV was diverged far from all three known groups of Coronavirus. Compositional constraints could explain most of the variation of synonymous codon usage among these virus genes, while gene function is also correlated to synonymous codon usages to a certain extent. However, translational selection and gene length have no effect on the variations of synonymous codon usage in these virus genes.
Bioinformatics | 2009
Jiansheng Wu; Hongde Liu; Xueye Duan; Yan Ding; Hongtao Wu; Yunfei Bai; Xiao Sun
Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthews correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
PLOS ONE | 2011
Zhidong Yuan; Xiao Sun; Hongde Liu; Jianming Xie
MicroRNAs (miRNAs) are a class of small noncoding RNAs that regulate gene expression by targeting mRNAs for translation repression or mRNA degradation. Many miRNAs are being discovered and studied, but in most cases their origin, evolution and function remain unclear. Here, we characterized miRNAs derived from repetitive elements and miRNA families expanded by segmental duplication events in the human, rhesus and mouse genomes. We applied a comparative genomics approach combined with identifying miRNA paralogs in segmental duplication pair data in a genome-wide study to identify new homologs of human miRNAs in the rhesus and mouse genomes. Interestingly, using segmental duplication pair data, we provided credible computational evidence that two miRNA genes are located in the pseudoautosomal region of the human Y chromosome. We characterized all the miRNAs whether they were derived from repetitive elements or not and identified significant differences between the repeat-related miRNAs (RrmiRs) and non-repeat-derived miRNAs in (1) their location in protein-coding and intergenic regions in genomes, (2) the minimum free energy of their hairpin structures, and (3) their conservation in vertebrate genomes. We found some lineage-specific RrmiR families and three lineage-specific expansion families, and provided evidence indicating that some RrmiR families formed and expanded during evolutionary segmental duplication events. We also provided computational and experimental evidence for the functions of the conservative RrmiR families in the three species. Together, our results indicate that repetitive elements contribute to the origin of miRNAs, and large segmental duplication events could prompt the expansion of some miRNA families, including RrmiR families. Our study is a valuable contribution to the knowledge of evolution and function of non-coding region in genome.
BioSystems | 2002
Jianmin Ma; Tong Zhou; Wanjun Gu; Xiao Sun; Zuhong Lu
The relative synonymous codon use frequency of 135 MHC genes from four mammal species (Homo sapiens, Pan troglodyte, Macaca mulanta and Rattus norvegicus) is analyzed using a hierarchical cluster method. The result suggests that gene function is the dominant factor that determines codon usage bias, while species is a minor factor that determines further difference in codon usage bias for genes with similar functions. The conclusion may be useful in gene classification and gene function prediction.
Biochemical and Biophysical Research Communications | 2008
Zhihua Liu; Jihong Meng; Xiao Sun
Traditional phylogenetic analysis is based on multiple sequence alignment. With the development of worldwide genome sequencing project, more and more completely sequenced genomes become available. However, traditional sequence alignment tools are impossible to deal with large-scale genome sequence. So, the development of new algorithms to infer phylogenetic relationship without alignment from whole genome information represents a new direction of phylogenetic study in the post-genome era. In the present study, a novel algorithm based on BBC (base-base correlation) is proposed to analyze the phylogenetic relationships of HEV (Hepatitis E virus). When 48 HEV genome sequences are analyzed, the phylogenetic tree that is constructed based on BBC algorithm is well consistent with that of previous study. When compared with methods of sequence alignment, the merit of BBC algorithm appears to be more rapid in calculating evolutionary distances of whole genome sequence and not requires any human intervention, such as gene identification, parameter selection. BBC algorithm can serve as an alternative to rapidly construct phylogenetic trees and infer evolutionary relationships.
Proteins | 2011
Xin Ma; Jing Guo; Jiansheng Wu; Hongde Liu; Jia-Feng Yu; Jianming Xie; Xiao Sun
The identification of RNA‐binding residues in proteins is important in several areas such as protein function, posttranscriptional regulation and drug design. We have developed PRBR (Prediction of RNA Binding Residues), a novel method for identifying RNA‐binding residues from amino acid sequences. Our method combines a hybrid feature with the enriched random forest (ERF) algorithm. The hybrid feature is composed of predicted secondary structure information and three novel features: evolutionary information combined with conservation information of the physicochemical properties of amino acids and the information about dependency of amino acids with regards to polarity‐charge and hydrophobicity in the protein sequences. Our results demonstrate that the PRBR model achieves 0.5637 Matthews correlation coefficient (MCC) and 88.63% overall accuracy (ACC) with 53.70% sensitivity (SE) and 96.97% specificity (SP). By comparing the performance of each feature we found that all three novel features contribute to the improved predictions. Area under the curve (AUC) statistics from receiver operating characteristic curve analysis was compared between PRBR model and other models. The results show that PRBR achieves the highest AUC value (0.8675) which represents that PRBR attains excellent performance on predicting the RNA‐binding residues in proteins. The PRBR web‐server implementation is freely available at http://www.cbi.seu.edu.cn/PRBR/. Proteins 2011;
BMC Evolutionary Biology | 2010
Zhidong Yuan; Xiao Sun; Dongke Jiang; Yan Ding; Z.H. Lu; Lejun Gong; Hongde Liu; Jianming Xie
BackgroundMicroRNAs (miRNAs) are a class of short regulatory RNAs encoded in the genome of DNA viruses, some single cell organisms, plants and animals. With the rapid development of technology, more and more miRNAs are being discovered. However, the origin and evolution of most miRNAs remain obscure. Here we report the origin and evolution dynamics of a human miRNA family.ResultsWe have shown that all members of the miR-1302 family are derived from MER53 elements. Although the conservation scores of the MER53-derived pre-miRNA sequences are low, we have identified 36 potential paralogs of MER53-derived miR-1302 genes in the human genome and 58 potential orthologs of the human miR-1302 family in placental mammals. We suggest that in placental species, this miRNA family has evolved following the birth-and-death model of evolution. Three possible mechanisms that can mediate miRNA duplication in evolutionary history have been proposed: the transposition of the MER53 element, segmental duplications and Alu-mediated recombination. Finally, we have found that the target genes of miR-1302 are over-represented in transportation, localization, and system development processes and in the positive regulation of cellular processes. Many of them are predicted to function in binding and transcription regulation.ConclusionsThe members of miR-1302 family that are derived from MER53 elements are placental-specific miRNAs. They emerged at the early stage of the recent 180 million years since eutherian mammals diverged from marsupials. Under the birth-and-death model, the miR-1302 genes have experienced a complex expansion with some members evolving by segmental duplications and some by Alu-mediated recombination events.
Journal of Genetics and Genomics | 2007
Peng Jiang; Xiao Sun; Zuhong Lu
Abstract In this study, a comparative analysis of the codon usage bias was performed in Aeropyrum pernix K1 and two other phylogenetically related Crenarchaeota microorganisms (i.e., Pyrobaculum aerophilum str. IM2 and Sulfolobus acidocaldarius DSM 639). The results indicated that the synonymous codon usage in A. pernix K1 was less biased, which was highly correlated with the GC3S value. The codon usage patterns were phylogenetically conserved among these Crenarchaeota microorganisms. Comparatively, it is the species function rather than the gene function that determines their gene codon usage patterns. A. pernix K1, P. aerophilum str. IM2, and S. acidocaldarius DSM 639 live in differently extreme conditions. It is presumed that the living environment played an important role in determining the codon usage pattern of these microorganisms. Besides, there was no strain-specific codon usage among these microorganisms. The extent of codon bias in A. pernix K1 and S. acidocaldarius DSM 639 were highly correlated with the gene expression level, but no such association was detected in P. aerophilum str. IM2 genomes.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2012
Xin Ma; Jing Guo; Hongde Liu; Jianming Xie; Xiao Sun
The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity-charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthews correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a 68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.