Yanjun Qi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanjun Qi is active.

Explore More

Publication

Featured researches published by Yanjun Qi.

Proteins | 2006

Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Yanjun Qi; Ziv Bar-Joseph; Judith Klein-Seetharaman

Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006.

Genome Biology | 2008

A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Lourdes Peña-Castillo; Murat Tasan; Chad L. Myers; Hyunju Lee; Trupti Joshi; Chao Zhang; Yuanfang Guan; Michele Leone; Andrea Pagnani; Wan-Kyu Kim; Chase Krumpelman; Weidong Tian; Guillaume Obozinski; Yanjun Qi; Guan Ning Lin; Gabriel F. Berriz; Francis D. Gibbons; Gert R. G. Lanckriet; Jian-Ge Qiu; Charles E. Grant; Zafer Barutcuoglu; David P. Hill; David Warde-Farley; Chris Grouios; Debajyoti Ray; Judith A. Blake; Minghua Deng; Michael I. Jordan; William Stafford Noble; Quaid Morris

Background:Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.Results:In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.Conclusion:We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

pacific symposium on biocomputing | 2004

Random forest similarity for protein-protein interaction prediction from multiple sources.

Yanjun Qi; Judith Klein-Seetharaman; Ziv Bar-Joseph

One of the most important, but often ignored, parts of any clustering and classification algorithm is the computation of the similarity matrix. This is especially important when integrating high throughput biological data sources because of the high noise rates and the many missing values. In this paper we present a new method to compute such similarities for the task of classifying pairs of proteins as interacting or not. Our method uses direct and indirect information about interaction pairs to constructs a random forest (a collection of decision tress) from a training set. The resulting forest is used to determine the similarity between protein pairs and this similarity is used by a classification algorithm (a modified kNN) to classify protein pairs. Testing the algorithm on yeast data indicates that it is able to improve coverage to 20% of interacting pairs with a false positive rate of 50%. These results compare favorably with all previously suggested methods for this task indicating the importance of robust similarity estimates.

Information Retrieval | 2010

Learning to rank with (a lot of) word features

Bing Bai; Jason Weston; David Grangier; Ronan Collobert; Kunihiko Sadamasa; Yanjun Qi; Olivier Chapelle; Kilian Q. Weinberger

In this article we present Supervised Semantic Indexing which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

conference on information and knowledge management | 2009

Supervised semantic indexing

Bing Bai; Jason Weston; David Grangier; Ronan Collobert; Kunihiko Sadamasa; Yanjun Qi; Olivier Chapelle; Kilian Q. Weinberger

In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

conference on information and knowledge management | 2011

Sentiment classification based on supervised latent n-gram analysis

Dmitriy Bespalov; Bing Bai; Yanjun Qi; Ali Shokoufandeh

In this paper, we propose an efficient embedding for modeling higher-order (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined. We utilize a deep neural network to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand. We apply the framework to large-scale sentimental classification task. We present comparative evaluation of the proposed method on two (large) benchmark data sets for online product reviews. The proposed method achieves superior performance in comparison to the state of the art.

Bioinformatics | 2010

Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins

Yanjun Qi; Oznur Tastan; Jaime G. Carbonell; Judith Klein-Seetharaman; Jason Weston

Motivation: Protein–protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled). Results: We propose a semi-supervised multi-task framework for predicting PPIs from not only labeled, but also partially labeled reference sets. The basic idea is to perform multi-task learning on a supervised classification task and a semi-supervised auxiliary task. The supervised classifier trains a multi-layer perceptron network for PPI predictions from labeled examples. The semi-supervised auxiliary task shares network layers of the supervised classifier and trains with partially labeled examples. Semi-supervision could be utilized in multiple ways. We tried three approaches in this article, (i) classification (to distinguish partial positives with negatives); (ii) ranking (to rate partial positive more likely than negatives); (iii) embedding (to make data clusters get similar labels). We applied this framework to improve the identification of interacting pairs between HIV-1 and human proteins. Our method improved upon the state-of-the-art method for this task indicating the benefits of semi-supervised multi-task learning using auxiliary information. Availability: http://www.cs.cmu.edu/∼qyj/HIVsemi Contact: [email protected]

IEEE Pervasive Computing | 2004

Automated analysis of nursing home observations

Alexander G. Hauptmann; Jiang Gao; Rong Yan; Yanjun Qi; Jie Yang; Howard D. Wactlar

Pervasive activity monitoring in a skilled-nursing facility helps capture a continuous audio and video record. The CareMedia project analyzes this video information by automatically tracking people, helping to efficiently label individuals, and characterizing selected activities and actions.

Nucleic Acids Research | 2015

Cas9-chromatin binding information enables more accurate CRISPR off-target prediction

Ritambhara Singh; Cem Kuscu; Aaron R. Quinlan; Yanjun Qi; Mazhar Adli

The CRISPR system has become a powerful biological tool with a wide range of applications. However, improving targeting specificity and accurately predicting potential off-targets remains a significant goal. Here, we introduce a web-based CRISPR/Cas9 Off-target Prediction and Identification Tool (CROP-IT) that performs improved off-target binding and cleavage site predictions. Unlike existing prediction programs that solely use DNA sequence information; CROP-IT integrates whole genome level biological information from existing Cas9 binding and cleavage data sets. Utilizing whole-genome chromatin state information from 125 human cell types further enhances its computational prediction power. Comparative analyses on experimentally validated datasets show that CROP-IT outperforms existing computational algorithms in predicting both Cas9 binding as well as cleavage sites. With a user-friendly web-interface, CROP-IT outputs scored and ranked list of potential off-targets that enables improved guide RNA design and more accurate prediction of Cas9 binding or cleavage sites.

Archive | 2012

Random Forest for Bioinformatics

Yanjun Qi

Modern biology has experienced an increased use of machine learning techniques for large scale and complex biological data analysis. In the area of Bioinformatics, the Random Forest (RF) [6] technique, which includes an ensemble of decision trees and incorporates feature selection and interactions naturally in the learning process, is a popular choice. It is nonparametric, interpretable, efficient, and has high prediction accuracy for many types of data. Recent work in computational biology has seen an increased use of RF, owing to its unique advantages in dealing with small sample size, high-dimensional feature space, and complex data structures.

Explore More