Xue Wen Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xue Wen Chen is active.

Explore More

Publication

Featured researches published by Xue Wen Chen.

Bioinformatics | 2005

Prediction of protein--protein interactions using random decision forest framework

Xue Wen Chen; Mei Liu

MOTIVATION Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain-domain interactions. RESULTS In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein-protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.

IEEE Transactions on Knowledge and Data Engineering | 2010

Combating the Small Sample Class Imbalance Problem Using Feature Selection

Michael Wasikowski; Xue Wen Chen

The class imbalance problem is encountered in real-world applications of machine learning and results in a classifiers suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.

knowledge discovery and data mining | 2008

FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems

Xue Wen Chen; Michael Wasikowski

The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifiers suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.

Signal Processing | 2013

Sparse representation and learning in visual recognition: Theory and applications

Hong Cheng; Zicheng Liu; Lu Yang; Xue Wen Chen

Sparse representation and learning has been widely used in computational intelligence, machine learning, computer vision and pattern recognition, etc. Mathematically, solving sparse representation and learning involves seeking the sparsest linear combination of basis functions from an overcomplete dictionary. A rational behind this is the sparse connectivity between nodes in human brain. This paper presents a survey of some recent work on sparse representation, learning and modeling with emphasis on visual recognition. It covers both the theory and application aspects. We first review the sparse representation and learning theory including general sparse representation, structured sparse representation, high-dimensional nonlinear learning, Bayesian compressed sensing, sparse subspace learning, non-negative sparse representation, robust sparse representation, and efficient sparse representation. We then introduce the applications of sparse theory to various visual recognition tasks, including feature representation and selection, dictionary learning, Sparsity Induced Similarity (SIS) measures, sparse coding based classification frameworks, and sparsity-related topics.

Bioinformatics | 2009

Sequence-based prediction of protein interaction sites with an integrative method

Xue Wen Chen; Jong Cheol Jeong

MOTIVATION Identification of protein interaction sites has significant impact on understanding protein function, elucidating signal transduction networks and drug design studies. With the exponentially growing protein sequence data, predictive methods using sequence information only for protein interaction site prediction have drawn increasing interest. In this article, we propose a predictive model for identifying protein interaction sites. Without using any structure data, the proposed method extracts a wide range of features from protein sequences. A random forest-based integrative model is developed to effectively utilize these features and to deal with the imbalanced data classification problem commonly encountered in binding site predictions. RESULTS We evaluate the predictive method using 2829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other sequence-based predictive methods and can reliably predict residues involved in protein interaction sites. Furthermore, we apply the method to predict interaction sites and to construct three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. We show that the predicted interaction sites can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues. AVAILABILITY Datasets and software are available at http://ittc.ku.edu/~xwchen/bindingsite/prediction.

IEEE Transactions on Knowledge and Data Engineering | 2008

Improving Bayesian Network Structure Learning with Mutual Information-Based Node Ordering in the K2 Algorithm

Xue Wen Chen; Gopalakrishna Anantha; Xiaotong Lin

Structure learning of Bayesian networks is a well-researched but computationally hard task. We present an algorithm that integrates an information-theory-based approach and a scoring-function-based approach for learning structures of Bayesian networks. Our algorithm also makes use of basic Bayesian network concepts like d-separation and condition independence. We show that the proposed algorithm is capable of handling networks with a large number of variables. We present the applicability of the proposed algorithm on four standard network data sets and also compare its performance and computational efficiency with other standard structure-learning methods. The experimental results show that our method can efficiently and accurately identify complex network structures from data.

BMC Bioinformatics | 2010

A Markov blanket-based method for detecting causal SNPs in GWAS

Bing Han; Meeyoung Park; Xue Wen Chen

BackgroundDetecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity.ResultsWe propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases.ConclusionsOur study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2011

On Position-Specific Scoring Matrix for Protein Function Prediction

Jong Cheol Jeong; Xiaotong Lin; Xue Wen Chen

While genome sequencing projects have generated tremendous amounts of protein sequence data for a vast number of genomes, substantial portions of most genomes are still unannotated. Despite the success of experimental methods for identifying protein functions, they are often lab intensive and time consuming. Thus, it is only practical to use in silico methods for the genome-wide functional annotations. In this paper, we propose new features extracted from protein sequence only and machine learning-based methods for computational function prediction. These features are derived from a position-specific scoring matrix, which has shown great potential in other bininformatics problems. We evaluate these features using four different classifiers and yeast protein data. Our experimental results show that features derived from the position-specific scoring matrix are appropriate for automatic function annotation.

intelligent systems in molecular biology | 2005

Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data

Jiangsheng Yu; Xue Wen Chen

MOTIVATION The classification of high-dimensional data is always a challenge to statistical machine learning. We propose a novel method named shallow feature selection that assigns each feature a probability of being selected based on the structure of training data itself. Independent of particular classifiers, the high dimension of biodata can be fleetly reduced to an applicable case for consequential processing. Moreover, to improve both efficiency and performance of classification, these prior probabilities are further used to specify the distributions of top-level hyperparameters in hierarchical models of Bayesian neural network (BNN), as well as the parameters in Gaussian process models. RESULTS Three BNN approaches were derived and then applied to identify ovarian cancer from NCIs high-resolution mass spectrometry data, which yielded an excellent performance in 1000 independent k-fold cross validations (k = 2,...,10). For instance, indices of average sensitivity and specificity of 98.56 and 98.42%, respectively, were achieved in the 2-fold cross validations. Furthermore, only one control and one cancer were misclassified in the leave-one-out cross validation. Some other popular classifiers were also tested for comparison. AVAILABILITY The programs implemented in MatLab, R and Neals fbm.2004-11-10.

BMC Medical Informatics and Decision Making | 2014

Mining adverse drug reactions from online healthcare forums using hidden Markov model

Hariprasad Sampathkumar; Xue Wen Chen; Bo Luo

BackgroundAdverse Drug Reactions are one of the leading causes of injury or death among patients undergoing medical treatments. Not all Adverse Drug Reactions are identified before a drug is made available in the market. Current post-marketing drug surveillance methods, which are based purely on voluntary spontaneous reports, are unable to provide the early indications necessary to prevent the occurrence of such injuries or fatalities. The objective of this research is to extract reports of adverse drug side-effects from messages in online healthcare forums and use them as early indicators to assist in post-marketing drug surveillance.MethodsWe treat the task of extracting adverse side-effects of drugs from healthcare forum messages as a sequence labeling problem and present a Hidden Markov Model(HMM) based Text Mining system that can be used to classify a message as containing drug side-effect information and then extract the adverse side-effect mentions from it. A manually annotated dataset from http://www.medications.comis used in the training and validation of the HMM based Text Mining system.ResultsA 10-fold cross-validation on the manually annotated dataset yielded on average an F-Score of 0.76 from the HMM Classifier, in comparison to 0.575 from the Baseline classifier. Without the Plain Text Filter component as a part of the Text Processing module, the F-Score of the HMM Classifier was reduced to 0.378 on average, while absence of the HTML Filter component was found to have no impact. Reducing the Drug names dictionary size by half, on average reduced the F-Score of the HMM Classifier to 0.359, while a similar reduction to the side-effects dictionary yielded an F-Score of 0.651 on average. Adverse side-effects mined from http://www.medications.comand http://www.steadyhealth.comwere found to match the Adverse Drug Reactions on the Drug Package Labels of several drugs. In addition, some novel adverse side-effects, which can be potential Adverse Drug Reactions, were also identified.ConclusionsThe results from the HMM based Text Miner are encouraging to pursue further enhancements to this approach. The mined novel side-effects can act as early indicators for health authorities to help focus their efforts in post-marketing drug surveillance.

Explore More