Robert E. Langlois
University of Illinois at Chicago
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Robert E. Langlois.
Nucleic Acids Research | 2005
Nitin Bhardwaj; Robert E. Langlois; Guijun Zhao; Hui Lu
DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones.
Bioinformatics | 2007
Zhi-Qiang Ye; Shu Qi Zhao; Ge Gao; Xiao Qiao Liu; Robert E. Langlois; Hui Lu; Liping Wei
MOTIVATION The rapid accumulation of single amino acid polymorphisms (SAPs), also known as non-synonymous single nucleotide polymorphisms (nsSNPs), brings the opportunities and needs to understand and predict their disease association. Currently published attributes are limited, the detailed mechanisms governing the disease association of a SAP remain unclear and thus, further investigation of new attributes and improvement of the prediction are desired. RESULTS A SAP dataset was compiled from the Swiss-Prot variant pages. We extracted and demonstrated the effectiveness of several new biologically informative attributes including the structural neighbor profiles that describe the SAPs microenvironment, nearby functional sites that measure the structure-based and sequence-based distances between the SAP site and its nearby functional sites, aggregation properties that measure the likelihood of protein aggregation and disordered regions that consider whether the SAP is located in structurally disordered regions. The new attributes provided insights into the mechanisms of the disease association of SAPs. We built a support vector machines (SVMs) classifier employing a carefully selected set of new and previously published attributes. Through a strict protein-level 5-fold cross-validation, we attained an overall accuracy of 82.61%, and an MCC of 0.60. Moreover, a web server was developed to provide a user-friendly interface for biologists. AVAILABILITY The web server is available at http://sapred.cbi.pku.edu.cn/
Nucleic Acids Research | 2010
Matthew B. Carson; Robert E. Langlois; Hui Lu
Nucleic acid-binding proteins are involved in a great number of cellular processes. Understanding the mechanisms underlying these proteins first requires the identification of specific residues involved in nucleic acid binding. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Here, we present a method for the identification of amino acid residues involved in DNA- and RNA-binding using sequence-based attributes. The method used in this work combines the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning. Our DNA-binding model achieved 79.1% accuracy, while the RNA-binding model reached an accuracy of 73.2%. The NAPS web server is freely available at http://proteomics.bioengr.uic.edu/NAPS.
Nucleic Acids Research | 2010
Robert E. Langlois; Hui Lu
DNA-binding proteins perform vital functions related to transcription, repair and replication. We have developed a new sequence-based machine learning protocol to identify DNA-binding proteins. We compare our method with an extensive benchmark of previously published structure-based machine learning methods as well as a standard sequence alignment technique, BLAST. Furthermore, we elucidate important feature interactions found in a learned model and analyze how specific rules capture general mechanisms that extend across DNA-binding motifs. This analysis is carried out using the malibu machine learning workbench available at http://proteomics.bioengr.uic.edu/malibu and the corresponding data sets and features are available at http://proteomics.bioengr.uic.edu/dna.
international conference of the ieee engineering in medicine and biology society | 2005
Nitin Bhardwaj; Robert E. Langlois; Guijun Zhao; Hui Lu
Annotation of the functional sites on the surface of a protein has been the subject of many studies. In this regard, the search for attributes and features characterizing these sites is of prime consequence. Here, we present an implementation of a kernel-based machine learning protocol for identifying residues on a DNA-binding protein form the interface with the DNA. Sequence and structural features including solvent accessibility, local composition, net charge and electrostatic potentials are examined. These features are then fed into support vector machines (SVM) to predict the DNA-binding residues on the surface of the protein. In order to compare with published work, we predict binding residues by training on other binding and non-binding residues in the same protein for which we achieved an accuracy of 79%. The sensitivity and specificity are 59% and 89%. We also consider a more realistic approach, predicting the binding residues of proteins entirely withheld from the training set achieving values of 66%, 43% and 81%, respectively. Performances reported here are better than other published results. Moreover, since our protocol does not lean on sequence or structural homology, it can be used to annotate unclassified proteins and more generally to identify novel binding sites with no similarity to the known cases
Annals of Biomedical Engineering | 2007
Robert E. Langlois; Matthew B. Carson; Nitin Bhardwaj; Hui Lu
A protein’s function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.
Cell Biochemistry and Biophysics | 2009
Georgi Z. Genchev; Morten Källberg; Gamze Gürsoy; Anuradha Mittal; Lalit Dubey; Ognjen Perišić; Gang Feng; Robert E. Langlois; Hui Lu
Efficient communication between the cell and its external environment is of the utmost importance to the function of multicellular organisms. While signaling events can be generally characterized as information exchange by means of controlled energy conversion, research efforts have hitherto mainly been concerned with mechanisms involving chemical and electrical energy transfer. Here, we review recent computational efforts addressing the function of mechanical force in signal transduction. Specifically, we focus on the role of steered molecular dynamics (SMD) simulations in providing details at the atomic level on a group of protein domains, which play a fundamental role in signal exchange by responding properly to mechanical strain. We start by giving a brief introduction to the SMD technique and general properties of mechanically stable protein folds, followed by specific examples illustrating three general regimes of signal transfer utilizing mechanical energy: purely mechanical, mechanical to chemical, and chemical to mechanical. Whenever possible the physiological importance of the example at hand is stressed to highlight the diversity of the processes in which mechanical signaling plays a key role. We also provide an overview of future challenges and perspectives for this rapidly developing field.
International Journal of Bioinformatics Research and Applications | 2005
Robert E. Langlois; Alice Diec; Ognjen Perišić; Yang Dai; Hui Lu
Because of the relatively large gap of knowledge between number of protein sequences and protein structures, the ability to construct a computational model predicting structure from sequence information has become an important area of research. The knowledge of a proteins structure is crucial in understanding its biological role. In this work, we present a support vector machine based method for recognising a proteins fold from sequence information alone, where this sequence has less similarity with sequences of known structures. We have focused on improving multi-class classification, parameter tuning, descriptor design, and feature selection. The current implementation demonstrates better prediction accuracy than previous similar approaches, and has similar performance when compared with straightforward threading.
international conference of the ieee engineering in medicine and biology society | 2008
Robert E. Langlois; Hui Lu
malibu is an open-source machine learning work-bench developed in C/C++ for high-performance real-world applications, namely bioinformatics and medical informatics. It leverages third-party machine learning implementations for more robust bug free software. This workbench handles several well-studied supervised machine learning problems including classification, regression, importance-weighted classification and multiple-instance learning. The malibu interface was designed to create reproducible experiments ideally run in a remote and/or command line environment. The software can be found at: http://proteomics.bioengr. uic.edu/malibu/index.html
international conference of the ieee engineering in medicine and biology society | 2008
Matthew B. Carson; Robert E. Langlois; Hui Lu
CpG island (CpGI) methylation is an epigenetic modification that occurs in eukaryotes and is based on the addition of a methyl group to the number 5 carbon of the pyrimidine ring of cytosine. When methylation of a CpGI occurs, the associated gene (if any) is not expressed [1]. Aberrant methylation is thought to be a causative agent in disease [2] and drug sensitivity [3], [4]. In this work, we have predicted the methylation status of CpGIs in human chromosome 21 using sequence patterns. These patterns showed a significantly different distribution between methylated and unmethylated islands in a previous work [5]. Using C4.5 with bagging and cost-sensitive learning, we achieved 85.6% accuracy, 82.8% sensitivity, and 86.4% specificity. We then constructed 1000 alternating decision trees using a bootstrapping method and analyzed the nodes that were conserved between the trees. This allowed us to find specific combinations of sequence patterns that distinguished between methylated and unmethylated CpGIs. Analysis of these characteristics offers certain insight into the conditions that permit or prevent methylation.