Xiaofei Nan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaofei Nan is active.

Explore More

Publication

Featured researches published by Xiaofei Nan.

acm southeast regional conference | 2010

A comparison of a graph database and a relational database: a data provenance perspective

Chad Vicknair; Michael Macias; Zhendong Zhao; Xiaofei Nan; Yixin Chen; Dawn Wilkins

Relational databases have been around for many decades and are the database technology of choice for most traditional data-intensive storage and retrieval applications. Retrievals are usually accomplished using SQL, a declarative query language. Relational database systems are generally efficient unless the data contains many relationships requiring joins of large tables. Recently there has been much interest in data stores that do not use SQL exclusively, the so-called NoSQL movement. Examples are Googles BigTable and Facebooks Cassandra. This paper reports on a comparison of one such NoSQL graph database called Neo4j with a common relational database system, MySQL, for use as the underlying technology in the development of a software system to record and query data provenance information.

BMC Bioinformatics | 2012

Implementation of multiple-instance learning in drug activity prediction

Gang Fu; Xiaofei Nan; Haining Liu; Ronak Y. Patel; Pankaj R. Daga; Yixin Chen; Dawn Wilkins; Robert J. Doerksen

BackgroundIn the context of drug discovery and development, much effort has been exerted to determine which conformers of a given molecule are responsible for the observed biological activity. In this work we aimed to predict bioactive conformers using a variant of supervised learning, named multiple-instance learning. A single molecule, treated as a bag of conformers, is biologically active if and only if at least one of its conformers, treated as an instance, is responsible for the observed bioactivity; and a molecule is inactive if none of its conformers is responsible for the observed bioactivity. The implementation requires instance-based embedding, and joint feature selection and classification. The goal of the present project is to implement multiple-instance learning in drug activity prediction, and subsequently to identify the bioactive conformers for each molecule.MethodsWe encoded the 3-dimensional structures using pharmacophore fingerprints which are binary strings, and accomplished instance-based embedding using calculated dissimilarity distances. Four dissimilarity measures were employed and their performances were compared. 1-norm SVM was used for joint feature selection and classification. The approach was applied to four data sets, and the best proposed model for each data set was determined by using the dissimilarity measure yielding the smallest number of selected features.ResultsThe predictive abilities of the proposed approach were compared with three classical predictive models without instance-based embedding. The proposed approach produced the best predictive models for one data set and second best predictive models for the rest of the data sets, based on the external validations. To validate the ability of the proposed approach to find bioactive conformers, 12 small molecules with co-crystallized structures were seeded in one data set. 10 out of 12 co-crystallized structures were indeed identified as significant conformers using the proposed approach.ConclusionsThe proposed approach was proven not to suffer from overfitting and to be highly competitive with classical predictive models, so it is very powerful for drug activity prediction. The approach was also validated as a useful method for pursuit of bioactive conformers.

Molecular Informatics | 2014

Quantitative Structure‐Activity Relationship Analysis and a Combined Ligand‐Based/Structure‐Based Virtual Screening Study for Glycogen Synthase Kinase‐3

Gang Fu; Sheng Liu; Xiaofei Nan; Olivia R. Dale; Zhendong Zhao; Yixin Chen; Dawn Wilkins; Susan P. Manly; Stephen J. Cutler; Robert J. Doerksen

Glycogen synthase kinase‐3 (GSK‐3) is a multifunctional serine/threonine protein kinase which regulates a wide range of cellular processes, involving various signalling pathways. GSK‐3β has emerged as an important therapeutic target for diabetes and Alzheimer’s disease. To identify structurally novel GSK‐3β inhibitors, we performed virtual screening by implementing a combined ligand‐based/structure‐based approach, which included quantitative structure‐activity relationship (QSAR) analysis and docking prediction. To integrate and analyze complex data sets from multiple experimental sources, we drafted and validated a hierarchical QSAR method, which adopts a two‐level structure to take data heterogeneity into account. A collection of 728 GSK‐3 inhibitors with diverse structural scaffolds was obtained from published papers that used different experimental assay protocols. Support vector machines and random forests were implemented with wrapper‐based feature selection algorithms to construct predictive learning models. The best models for each single group of compounds were then used to build the final hierarchical QSAR model, with an overall R2 of 0.752 for the 141 compounds in the test set. The compounds obtained from the virtual screening experiment were tested for GSK‐3β inhibition. The bioassay results confirmed that 2 hit compounds are indeed GSK‐3β inhibitors exhibiting sub‐micromolar inhibitory activity, and therefore validated our combined ligand‐based/structure‐based approach as effective for virtual screening experiments.

Neurocomputing | 2012

Biomarker discovery using 1-norm regularization for multiclass earthworm microarray gene expression data

Xiaofei Nan; Nan Wang; Ping Gong; Chaoyang Zhang; Yixin Chen; Dawn Wilkins

Novel biomarkers can be discovered through mining high dimensional microarray datasets using machine learning techniques. Here we propose a novel recursive gene selection method which can handle the multiclass setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multiclass classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The empirical results demonstrate that the selected features (genes) have very competitive discriminative power. In addition, the selection process has fast rate of convergence.

BMC Bioinformatics | 2011

Leveraging domain information to restructure biological prediction

Xiaofei Nan; Gang Fu; Zhengdong Zhao; Sheng Liu; Ronak Y. Patel; Haining Liu; Pankaj R. Daga; Robert J. Doerksen; Xin Dang; Yixin Chen; Dawn Wilkins

BackgroundIt is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.ResultsWe consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem.ConclusionsThe proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.

acm southeast regional conference | 2010

Learning to rank using 1-norm regularization and convex hull reduction

Xiaofei Nan; Yixin Chen; Xin Dang; Dawn Wilkins

The ranking problem appears in many areas of study such as customer rating, social science, economics, and information retrieval. Ranking can be formulated as a classification problem when pair-wise data is considered. However this approach increases the problem complexity from linear to quadratic in terms of sample size. We present in this paper a convex hull reduction method to reduce this impact. We also propose a 1-norm regularization approach to simultaneously find a linear ranking function and to perform feature subset selection. The proposed method is formulated as a linear program. We present experimental results on artificial data and two real data sets, concrete compressive strength data set and Abalone data set.

bioinformatics and biomedicine | 2010

Gene selection using 1-norm regularization for multi-class microarray data

Xiaofei Nan; Nan Wang; Ping Gong; Chaoyang Zhang; Yixin Chen; Dawn Wilkins

Explosive compounds such as TNT and RDX have various toxicological effects on the natural environment. The goal of the earthworm microarray experiment is to unearth the biomarker for toxicity evaluation. We propose a novel recursive gene selection method which can handle the multi-class setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multi-class classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The empirical results demonstrate that the selected features (genes) have very competitive discriminative power. In addition, the selection process has fast rate of convergence.

BMC Genomics | 2016

Predicting chemical bioavailability using microarray gene expression data and regression modeling: A tale of three explosive compounds

Ping Gong; Xiaofei Nan; Natalie D. Barker; Robert E. Boyd; Yixin Chen; Dawn Wilkins; David R. Johnson; Burton C. Suedel; Edward J. Perkins

BackgroundChemical bioavailability is an important dose metric in environmental risk assessment. Although many approaches have been used to evaluate bioavailability, not a single approach is free from limitations. Previously, we developed a new genomics-based approach that integrated microarray technology and regression modeling for predicting bioavailability (tissue residue) of explosives compounds in exposed earthworms. In the present study, we further compared 18 different regression models and performed variable selection simultaneously with parameter estimation.ResultsThis refined approach was applied to both previously collected and newly acquired earthworm microarray gene expression datasets for three explosive compounds. Our results demonstrate that a prediction accuracy of R2 = 0.71–0.82 was achievable at a relatively low model complexity with as few as 3–10 predictor genes per model. These results are much more encouraging than our previous ones. ConclusionThis study has demonstrated that our approach is promising for bioavailability measurement, which warrants further studies of mixed contamination scenarios in field settings

Archive | 2010