Lifeng Peng
Victoria University of Wellington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lifeng Peng.
congress on evolutionary computation | 2013
Soha Ahmed; Mengjie Zhang; Lifeng Peng
Biomarker detection in LC-MS data depends mainly on feature selection algorithms as the number of features is extremely high while the number of samples is very small. This makes classification of these data sets extremely challenging. In this paper we propose the use of genetic programming (GP) for subset feature selection in LC-MS data which works by maximizing the signal to noise ratio of the selected features by GP. The proposed method was applied to eight LC-MS data sets with different sample sizes and different levels of concentration of the spiked biomarkers. We evaluated the accuracy of selection from the list of biomarkers and also using the classification accuracy of the selected features via the support vector machines (SVMs) and Naive Bayes (NB) classifiers. Features selected by the proposed GP method managed to achieve perfect classification accuracy for most of the data sets. The results show that the proposed method strikes a reasonable compromise between the detection rate of the biomarkers and the classification accuracy for all data sets. The method was also compared to linear Support Vector Machine-Recursive Features Elimination (SVM-RFE) and t-test for feature selection and the results show that the biomarker detection rate of the proposed approach is higher.
genetic and evolutionary computation conference | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng; Bing Xue
Biomarker identification, i.e., detecting the features that indicate differences between two or more classes, is an important task in omics sciences. Mass spectrometry (MS) provide a high throughput analysis of proteomic and metabolomic data. The number of features of the MS data sets far exceeds the number of samples, making biomarker identification extremely difficult. Feature construction can provide a means for solving this problem by transforming the original features to a smaller number of high-level features. This paper investigates the construction of multiple features using genetic programming (GP) for biomarker identification and classification of mass spectrometry data. In this paper, multiple features are constructed using GP by adopting an embedded approach in which Fisher criterion and p-values are used to measure the discriminating information between the classes. This produces nonlinear high-level features from the low-level features for both binary and multi-class mass spectrometry data sets. Meanwhile, seven different classifiers are used to test the effectiveness of the constructed features. The proposed GP method is tested on eight different mass spectrometry data sets. The results show that the high-level features constructed by the GP method are effective in improving the classification performance in most cases over the original set of features and the low-level selected features. In addition, the new method shows superior performance in terms of biomarker detection rate.
evolutionary computation machine learning and data mining in bioinformatics | 2013
Soha Ahmed; Mengjie Zhang; Lifeng Peng
Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets.
Connection Science | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng
Feature selection on mass spectrometry (MS) data is essential for improving classification performance and biomarker discovery. The number of MS samples is typically very small compared with the high dimensionality of the samples, which makes the problem of biomarker discovery very hard. In this paper, we propose the use of genetic programming for biomarker detection and classification of MS data. The proposed approach is composed of two phases: in the first phase, feature selection and ranking are performed. In the second phase, classification is performed. The results show that the proposed method can achieve better classification performance and biomarker detection rate than the information gain- (IG) based and the RELIEF feature selection methods. Meanwhile, four classifiers, Naive Bayes, J48 decision tree, random forest and support vector machines, are also used to further test the performance of the top ranked features. The results show that the four classifiers using the top ranked features from the proposed method achieve better performance than the IG and the RELIEF methods. Furthermore, GP also outperforms a genetic algorithm approach on most of the used data sets.
genetic and evolutionary computation conference | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng
The use of mass spectrometry to verify and quantify biomarkers requires the identification of the peptides that can be detectable. In this paper, we propose the use of genetic programming (GP) to measure the detection probability of the peptides. The new GP method is tested and verified on two different yeast data sets with increasing complexity and shows improved performance over other state-of-art classification and feature selection algorithms.
congress on evolutionary computation | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng
Mass spectrometry (MS) is a technology used for identification and quantification of proteins and metabolites. It helps in the discovery of proteomic or metabolomic biomarkers, which aid in diseases detection and drug discovery. The detection of biomarkers is performed through the classification of patients from healthy samples. The mass spectrometer produces high dimensional data where most of the features are irrelevant for classification. Therefore, feature reduction is needed before the classification of MS data can be done effectively. Feature construction can provide a means of dimensionality reduction and aims at improving the classification performance. In this paper, genetic programming (GP) is used for construction of multiple features. Two methods are proposed for this objective. The proposed methods work by wrapping a Random Forest (RF) classifier to GP to ensure the quality of the constructed features. Meanwhile, five other classifiers in addition to RF are used to test the impact of the constructed features on the performance of these classifiers. The results show that the proposed GP methods improved the performance of classification over using the original set of features in five MS data sets.
european conference on applications of evolutionary computation | 2016
Soha Ahmed; Mengjie Zhang; Lifeng Peng; Bing Xue
Mass spectrometry is currently the most commonly used technology in biochemical research for proteomic analysis. The main goal of proteomic profiling using mass spectrometry is the classification of samples from different clinical states. This requires the identification of proteins or peptides (biomarkers) that are expressed differentially between different clinical states. However, due to the high dimensionality of the data and the small number of samples, classification of mass spectrometry data is a challenging task. Therefore, an effective feature manipulation algorithm either through feature selection or construction is needed to enhance the classification performance and at the same time minimise the number of features. Most of the feature manipulation methods for mass spectrometry data treat this problem as a single objective task which focuses on improving the classification performance. This paper presents two new methods for biomarker detection through multi-objective feature selection and feature construction. The results show that the proposed multi-objective feature selection method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. Moreover, the multi-objective feature construction algorithm further improves the perfomance over the multi-objective feature selection algorithm. This paper is the first multi-objective genetic programming approach for biomarker detection in mass spectrometry data.
simulated evolution and learning | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng; Bing Xue
The biomarker discovery process usually produces a long list of candidates, which need to be verified. The verification of protein biomarkers from mass spectrometry data can be done through measuring the detection probability from the mass spectrometer Peptide detection. However, the limited size of the experimental data and lack of a universal quantitative method make the identification of these peptides challenging. In this paper, genetic programming GP is proposed to measure the detection of the peptides in the mass spectrometer. This is done through measuring the physicochemical chemicals of the peptides and selecting the high responding peptides. The proposed method performs both feature selection and classification, where feature selection is adopted to determine the important physicochemical properties required for the prediction. The proposed GP method is tested on two different yeast data sets with increasing complexity. It outperforms five other state-of-the-art classification algorithms. The results also show that GP outperforms two conventional feature selection methods, namely, Chi Square and Information Gain Ratio.
european conference on applications of evolutionary computation | 2014
Soha Ahmed; Mengjie Zhang; Lifeng Peng
Alignment of samples from Liquid chromatography-mass spectrometry (LC-MS) measurements has a significant role in the detection of biomarkers and in metabolomic studies.The machine drift causes differences between LC-MS measurements, and an accurate alignment of the shifts introduced to the same peptide or metabolite is needed. In this paper, we propose the use of genetic programming (GP) for multiple alignment of LC-MS data. The proposed approach consists of two main phases. The first phase is the peak matching where the peaks from different LC-MS maps (peak lists) are matched to allow the calculation of the retention time deviation. The second phase is to use GP for multiple alignment of the peak lists with respect to a reference. In this paper, GP is designed to perform multiple-output regression by using a special node in the tree which divides the output of the tree into multiple outputs. Finally, the peaks that show the maximum correlation after dewarping the retention times are selected to form a consensus aligned map.The proposed approach is tested on one proteomics and two metabolomics LC-MS datasets with different number of samples. The method is compared to several benchmark methods and the results show that the proposed approach outperforms these methods in three fractions of the protoemics dataset and the metabolomics dataset with a larger number of maps. Moreover, the results on the rest of the datasets are highly competitive with the other methods.
simulated evolution and learning | 2012
Rohitash Chandra; Mengjie Zhang; Lifeng Peng
Metabolic fluxes are the key determinants of cellular metabolism. They are estimated from stable isotope labeling experiments and provide the informative parameters in evaluating cell physiology and causes of disease. Metabolic flux analysis involves in solving a system of non-linear isotopomer balance equations by simulating the isotopic labeling distributions of metabolites measured by tandem mass spectrometry, which is essentially an optimization problem. In this work, we introduce the cooperative coevolution optimization method for solving the set of non-linear equations that decomposes a large problem into a set of subcomponents. We demonstrate that cooperative coevolution can be used for solving the given metabolic flux model. While the proposed approach makes good progress on the use of evolutionary computation techniques for this problem, there exist a number of disadvantages that need to be addressed in the future to meet the expectation of the biologists.