Shanfeng Zhu
Fudan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shanfeng Zhu.
Briefings in Bioinformatics | 2014
Hao Ding; Ichigaku Takigawa; Hiroshi Mamitsuka; Shanfeng Zhu
Computationally predicting drug-target interactions is useful to select possible drug (or target) candidates for further biochemical verification. We focus on machine learning-based approaches, particularly similarity-based methods that use drug and target similarities, which show relationships among drugs and those among targets, respectively. These two similarities represent two emerging concepts, the chemical space and the genomic space. Typically, the methods combine these two types of similarities to generate models for predicting new drug-target interactions. This process is also closely related to a lot of work in pharmacogenomics or chemical biology that attempt to understand the relationships between the chemical and genomic spaces. This background makes the similarity-based approaches attractive and promising. This article reviews the similarity-based machine learning methods for predicting drug-target interactions, which are state-of-the-art and have aroused great interest in bioinformatics. We describe each of these methods briefly, and empirically compare these methods under a uniform experimental setting to explore their advantages and limitations.
Briefings in Bioinformatics | 2012
Lianming Zhang; Keiko Udaka; Hiroshi Mamitsuka; Shanfeng Zhu
Binding of short antigenic peptides to major histocompatibility complex (MHC) molecules is a core step in adaptive immune response. Precise identification of MHC-restricted peptides is of great significance for understanding the mechanism of immune response and promoting the discovery of immunogenic epitopes. However, due to the extremely high MHC polymorphism and huge cost of biochemical experiments, there is no experimentally measured binding data for most MHC molecules. To address the problem of predicting peptides binding to these MHC molecules, recently computational approaches, called pan-specific methods, have received keen interest. Pan-specific methods make use of experimentally obtained binding data of multiple alleles, by which binding peptides (binders) of not only these alleles but also those alleles with no known binders can be predicted. To investigate the possibility of further improvement in performance and usability of pan-specific methods, this article extensively reviews existing pan-specific methods and their web servers. We first present a general framework of pan-specific methods. Then, the strategies and performance as well as utilities of web servers are compared. Finally, we discuss the future direction to improve pan-specific methods for MHC-peptide binding prediction.
knowledge discovery and data mining | 2013
Xiaodong Zheng; Hao Ding; Hiroshi Mamitsuka; Shanfeng Zhu
We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those over targets. This setting has been considered by many methods, which however have a common problem of allowing to have only one similarity matrix over drugs and that over targets. The key idea of our approach is to use more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select similarities, which are effective for improving the performance of predicting drug-target interactions. We propose a factor model, named Multiple Similarities Collaborative Matrix Factorization(MSCMF), which projects drugs and targets into a common low-rank feature space, which is further consistent with weighted similarity matrices over drugs and those over targets. These two low-rank matrices and weights over similarity matrices are estimated by an alternating least squares algorithm. Our approach allows to predict drug-target interactions by the two low-rank matrices collaboratively and to detect similarities which are important for predicting drug-target interactions. This approach is general and applicable to any binary relations with similarities over elements, being found in many applications, such as recommender systems. In fact, MSCMF is an extension of weighted low-rank approximation for one-class collaborative filtering. We extensively evaluated the performance of MSCMF by using both synthetic and real datasets. Experimental results showed nice properties of MSCMF on selecting similarities useful in improving the predictive performance and the performance advantage of MSCMF over six state-of-the-art methods for predicting drug-target interactions.
Bioinformatics | 2009
Shanfeng Zhu; Jia Zeng; Hiroshi Mamitsuka
MOTIVATION Clustering MEDLINE documents is usually conducted by the vector space model, which computes the content similarity between two documents by basically using the inner-product of their word vectors. Recently, the semantic information of MeSH (Medical Subject Headings) thesaurus is being applied to clustering MEDLINE documents by mapping documents into MeSH concept vectors to be clustered. However, current approaches of using MeSH thesaurus have two serious limitations: first, important semantic information may be lost when generating MeSH concept vectors, and second, the content information of the original text has been discarded. METHODS Our new strategy includes three key points. First, we develop a sound method for measuring the semantic similarity between two documents over the MeSH thesaurus. Second, we combine both the semantic and content similarities to generate the integrated similarity matrix between documents. Third, we apply a spectral approach to clustering documents over the integrated similarity matrix. RESULTS Using various 100 datasets of MEDLINE records, we conduct extensive experiments with changing alternative measures and parameters. Experimental results show that integrating the semantic and content similarities outperforms the case of using only one of the two similarities, being statistically significant. We further find the best parameter setting that is consistent over all experimental conditions conducted. We finally show a typical example of resultant clusters, confirming the effectiveness of our strategy in improving MEDLINE document clustering. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
PLOS ONE | 2012
Lianming Zhang; Yiqing Chen; Hau-San Wong; Shuigeng Zhou; Hiroshi Mamitsuka; Shanfeng Zhu
Motivation Accurate identification of peptides binding to specific Major Histocompatibility Complex Class II (MHC-II) molecules is of great importance for elucidating the underlying mechanism of immune recognition, as well as for developing effective epitope-based vaccines and promising immunotherapies for many severe diseases. Due to extreme polymorphism of MHC-II alleles and the high cost of biochemical experiments, the development of computational methods for accurate prediction of binding peptides of MHC-II molecules, particularly for the ones with few or no experimental data, has become a topic of increasing interest. TEPITOPE is a well-used computational approach because of its good interpretability and relatively high performance. However, TEPITOPE can be applied to only 51 out of over 700 known HLA DR molecules. Method We have developed a new method, called TEPITOPEpan, by extrapolating from the binding specificities of HLA DR molecules characterized by TEPITOPE to those uncharacterized. First, each HLA-DR binding pocket is represented by amino acid residues that have close contact with the corresponding peptide binding core residues. Then the pocket similarity between two HLA-DR molecules is calculated as the sequence similarity of the residues. Finally, for an uncharacterized HLA-DR molecule, the binding specificity of each pocket is computed as a weighted average in pocket binding specificities over HLA-DR molecules characterized by TEPITOPE. Result The performance of TEPITOPEpan has been extensively evaluated using various data sets from different viewpoints: predicting MHC binding peptides, identifying HLA ligands and T-cell epitopes and recognizing binding cores. Among the four state-of-the-art competing pan-specific methods, for predicting binding specificities of unknown HLA-DR molecules, TEPITOPEpan was roughly the second best method next to NETMHCIIpan-2.0. Additionally, TEPITOPEpan achieved the best performance in recognizing binding cores. We further analyzed the motifs detected by TEPITOPEpan, examining the corresponding literature of immunology. Its online server and PSSMs therein are available at http://www.biokdd.fudan.edu.cn/Service/TEPITOPEpan/.
Journal of Immunological Methods | 2011
Guang Lan Zhang; Hifzur Rahman Ansari; Phil Bradley; Gavin C. Cawley; Tomer Hertz; Xihao Hu; Nebojsa Jojic; Yohan Kim; Oliver Kohlbacher; Ole Lund; Claus Lundegaard; Craig A. Magaret; Morten Nielsen; Harris Papadopoulos; Gajendra P. S. Raghava; Vider-Shalit Tal; Li C. Xue; Chen Yanover; Shanfeng Zhu; Michael T. Rock; James E. Crowe; Christos G. Panayiotou; Marios M. Polycarpou; Włodzisław Duch; Vladimir Brusic
Experimental studies of immune system and related applications such as characterization of immune responses against pathogens, vaccine design, or optimization of therapies are combinatorially complex, time-consuming and expensive. The main methods for large-scale identification of T-cell epitopes from pathogens or cancer proteomes involve either reverse immunology or high-throughput mass spectrometry (HTMS). Reverse immunology approaches involve pre-screening of proteomes by computational algorithms, followed by experimental validation of selected targets (Mora et al., 2006; De Groot et al., 2008; Larsen et al., 2010). HTMS involves HLA typing, immunoaffinity chromatography of HLA molecules, HLA extraction, and chromatography combined with tandem mass spectrometry, followed by the application of computational algorithms for peptide characterization (Bassani-Sternberg et al., 2010). Hundreds of naturally processed HLA class I associated peptides have been identified in individual studies using HTMS in normal (Escobar et al., 2008), cancer (Antwi et al., 2009; Bassani-Sternberg et al., 2010), autoimmunity-related (Ben Dror et al., 2010), and infected samples (Wahl et al, 2010). Computational algorithms are essential steps in highthroughput identification of T-cell epitope candidates using both reverse immunology and HTMS approaches. Peptide binding to MHC molecules is the single most selective step in defining T cell epitope and the accuracy of computational algorithms for prediction of peptide binding, therefore, determines the accuracy of the overall method. Computational predictions of peptide binding to HLA, both class I and class II, use a variety of algorithms ranging from binding motifs to advanced machine learning techniques (Brusic et al., 2004; Lafuente and Reche, 2009) and standards for their
Bioinformatics | 2006
Shanfeng Zhu; Keiko Udaka; John Sidney; Alessandro Sette; Kiyoko F. Aoki-Kinoshita; Hiroshi Mamitsuka
MOTIVATION Various computational methods have been proposed to tackle the problem of predicting the peptide binding ability for a specific MHC molecule. These methods are based on known binding peptide sequences. However, current available peptide databases do not have very abundant amounts of examples and are highly redundant. Existing studies show that MHC molecules can be classified into supertypes in terms of peptide-binding specificities. Therefore, we first give a method for reducing the redundancy in a given dataset based on information entropy, then present a novel approach for prediction by learning a predictive model from a dataset of binders for not only the molecule of interest but also for other MHC molecules. RESULTS We experimented on the HLA-A family with the binding nonamers of A1 supertype (HLA-A*0101, A*2601, A*2902, A*3002), A2 supertype (A*0201, A*0202, A*0203, A*0206, A*6802), A3 supertype (A*0301, A*1101, A*3101, A*3301, A*6801) and A24 supertype (A*2301 and A*2402), whose data were collected from six publicly available peptide databases and two private sources. The results show that our approach significantly improves the prediction accuracy of peptides that bind a specific HLA molecule when we combine binding data of HLA molecules in the same supertype. Our approach can thus be used to help find new binders for MHC molecules.
Briefings in Bioinformatics | 2009
Jia Zeng; Shanfeng Zhu; Hong Yan
This review describes important advances that have been made during the past decade for genome-wide human promoter recognition. Interest in promoter recognition algorithms on a genome-wide scale is worldwide and touches on a number of practical systems that are important in analysis of gene regulation and in genome annotation without experimental support of ESTs, cDNAs or mRNAs. The main focus of this review is on feature extraction and model selection for accurate human promoter recognition, with descriptions of what they are, what has been accomplished, and what remains to be done.
Information Sciences | 2011
Xiaodi Huang; Xiaodong Zheng; Wei Yuan; Fei Wang; Shanfeng Zhu
Searching and mining biomedical literature databases are common ways of generating scientific hypotheses by biomedical researchers. Clustering can assist researchers to form hypotheses by seeking valuable information from grouped documents effectively. Although a large number of clustering algorithms are available, this paper attempts to answer the question as to which algorithm is best suited to accurately cluster biomedical documents. Non-negative matrix factorization (NMF) has been widely applied to clustering general text documents. However, the clustering results are sensitive to the initial values of the parameters of NMF. In order to overcome this drawback, we present the ensemble NMF for clustering biomedical documents in this paper. The performance of ensemble NMF was evaluated on numerous datasets generated from the TREC Genomics track dataset. With respect to most datasets, the experimental results have demonstrated that the ensemble NMF significantly outperforms classical clustering algorithms of bisecting K-means, and hierarchical clustering. We compared four different methods for constructing an ensemble NMF. For clustering biomedical documents, this research is the first to compare ensemble NMF with typical classical clustering algorithms, and validates ensemble NMF constructed from different graph-based ensemble algorithms. This is also the first work on ensemble NMF with Hybrid Bipartite Graph Formulation for clustering biomedical documents.
Bioinformatics | 2016
Qingjun Yuan; Junning Gao; Dongliang Wu; Shihua Zhang; Hiroshi Mamitsuka; Shanfeng Zhu
Motivation: Identifying drug–target interactions is an important task in drug discovery. To reduce heavy time and financial cost in experimental way, many computational approaches have been proposed. Although these approaches have used many different principles, their performance is far from satisfactory, especially in predicting drug–target interactions of new candidate drugs or targets. Methods: Approaches based on machine learning for this problem can be divided into two types: feature-based and similarity-based methods. Learning to rank is the most powerful technique in the feature-based methods. Similarity-based methods are well accepted, due to their idea of connecting the chemical and genomic spaces, represented by drug and target similarities, respectively. We propose a new method, DrugE-Rank, to improve the prediction performance by nicely combining the advantages of the two different types of methods. That is, DrugE-Rank uses LTR, for which multiple well-known similarity-based methods can be used as components of ensemble learning. Results: The performance of DrugE-Rank is thoroughly examined by three main experiments using data from DrugBank: (i) cross-validation on FDA (US Food and Drug Administration) approved drugs before March 2014; (ii) independent test on FDA approved drugs after March 2014; and (iii) independent test on FDA experimental drugs. Experimental results show that DrugE-Rank outperforms competing methods significantly, especially achieving more than 30% improvement in Area under Prediction Recall curve for FDA approved new drugs and FDA experimental drugs. Availability: http://datamining-iip.fudan.edu.cn/service/DrugE-Rank Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.