Patrick C. H. Ma
Hong Kong Polytechnic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Patrick C. H. Ma.
IEEE Transactions on Evolutionary Computation | 2006
Patrick C. H. Ma; Keith C. C. Chan; Xin Yao; David K. Y. Chiu
Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster.
IEEE Transactions on Biomedical Engineering | 2011
Patrick C. H. Ma; Keith C. C. Chan
Due to the complexity of the underlying biological processes, gene expression data obtained from DNA microarray technologies are typically noisy and have very high dimensionality and these make the mining of such data for gene function prediction very difficult. To tackle these difficulties, we propose to use an incremental fuzzy mining technique called incremental fuzzy mining (IFM). By transforming quantitative expression values into linguistic terms, such as highly or lowly expressed, IFM can effectively capture heterogeneity in expression data for pattern discovery. It does so using a fuzzy measure to determine if interesting association patterns exist between the linguistic gene expression levels. Based on these patterns, IFM can make accurate gene function predictions and these predictions can be made in such a way that each gene can be allowed to belong to more than one functional class with different degrees of membership. Gene function prediction problem can be formulated both as classification and clustering problems, and IFM can be used either as a classification technique or together with existing clustering algorithms to improve the cluster groupings discovered for greater prediction accuracies. IFM is characterized also by its being an incremental data mining technique so that the discovered patterns can be continually refined based only on newly collected data without the need for retraining using the whole dataset. For performance evaluation, IFM has been tested with real expression datasets for both classification and clustering tasks. Experimental results show that it can effectively uncover hidden patterns for accurate gene function predictions.
IEEE Transactions on Fuzzy Systems | 2008
Patrick C. H. Ma; Keith C. C. Chan
For one to infer the structures of a gene regulatory network (GRN), it is important to identify, for each gene in the GRN, which other genes can affect its expression and how they can affect it. For this purpose, many algorithms have been developed to generate hypotheses about the presence or absence of interactions between genes. These algorithms, however, cannot be used to determine if a gene activates or inhibits another. To obtain such information to better infer GRN structures, we propose a fuzzy data mining technique here. By transforming quantitative expression values into linguistic terms, it defines a measure of fuzzy dependency among genes. Using such a measure, the technique is able to discover interesting fuzzy dependency relationships in noisy, high dimensional time series expression data so that it can not only determine if a gene is dependent on another but also if a gene is supposed to be activated or inhibited. In addition, the technique can also predict how a gene in an unseen sample (i.e., expression data that are not in the original database) would be affected by other genes in it and this makes statistical verification of the reliability of the discovered gene interactions easier. For evaluation, the proposed technique has been tested using real expression data and experimental results show that the use of fuzzy-logic based technique in gene expression data analysis can be quite effective.
IEEE Transactions on Biomedical Engineering | 2009
Patrick C. H. Ma; Keith C. C. Chan
Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.
Journal of Computational Biology | 2008
Patrick C. H. Ma; Keith C. C. Chan
To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful.
Journal of Bioinformatics and Computational Biology | 2007
Patrick C. H. Ma; Keith C. C. Chan
Recent development in DNA microarray technologies has made the reconstruction of gene regulatory networks (GRNs) feasible. To infer the overall structure of a GRN, there is a need to find out how the expression of each gene can be affected by the others. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory experiments can be performed afterwards for verification. Since, they are not intended to be used to predict if a gene in an unseen sample has any interactions with other genes, statistical verification of the reliability of the discovered interactions can be difficult. Furthermore, since the temporal ordering of the data is not taken into consideration, the directionality of regulation cannot be established using these existing techniques. To tackle these problems, we propose a data mining technique here. This technique makes use of a probabilistic inference approach to uncover interesting dependency relationships in noisy, high-dimensional time series expression data. It is not only able to determine if a gene is dependent on another but also whether or not it is activated or inhibited. In addition, it can predict how a gene would be affected by other genes even in unseen samples. For performance evaluation, the proposed technique has been tested with real expression data. Experimental results show that it can be very effective. The discovered dependency relationships can reveal gene regulatory relationships that could be used to infer the structures of GRNs.
IEEE Transactions on Nanobioscience | 2009
Patrick C. H. Ma; Keith C. C. Chan
Clustering is concerned with the discovery of groupings of records in a database. Many clustering problems are defined as partitioning problems in the sense that the similar records are grouped into nonoverlapping partitions. However, the clustering of gene expression data to discover coexpressed genes may not always be meaningful if this problem is reduced into a partitioning problem. Due to the complexity of the underlying biological processes, a protein can interact with one or more other proteins belonging to different functional classes in order to perform a particular biological role. For this reason, when responding to different external stimulants, a gene that produces a particular protein can coexpress with more than one group of other genes. The gene can therefore belong to more than one group of coexpressed genes. This poses a challenge to many clustering algorithms as they are not originally developed to discover overlapping clusters in noisy gene expression data. In this paper, we propose an iterative data mining approach that consists of two phases as follows. In phase 1, a clustering algorithm is used to discover the initial, nonoverlapping partitioning of gene expression profiles in gene expression data. Then, the partition memberships of genes are redetermined iteratively in phase 2 by a pattern discovery technique so as to determine that if a gene should remain in the same partition, be moved to another partition, or be also grouped together with other genes in another partitions. The proposed approach has been tested with both artificial and real datasets. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively discover overlapping clusters in noisy gene expression data.
international conference on tools with artificial intelligence | 2003
Patrick C. H. Ma; Keith C. C. Chan
The combined interpretation of gene expression data and gene sequences offers a valuable approach to investigate the intricate relationships involving gene transcriptional regulation. The highly interactive expression data produced by microarray hybridization experiments allow us to find the clusters of coexpressed genes. By analyzing the upstream regions of the identified coexpressed genes, we can discover the regulatory patterns characterized by transcription factor binding sites, which govern the process of transcriptional regulation. This paper presents a generic clustering algorithm that uses a Hybrid GA approach to discover clusters in gene expression data. The advantage of this method is that large search space can be effectively explored by utilizing the evolutionary algorithm techniques. Moreover, it is able to discover underlying patterns in noisy gene expression data for meaningful data groupings, and also statistically significant patterns hidden in each cluster can be extracted at the same time. Since the proposed method can handle both continuous-and discrete-valued data, it can be used with different microarray and biomedical data. To test its effectiveness, we have used it on real expression data. The experimental results reveal meaningful groupings and uncover many known transcription factor binding sites.
computational intelligence in bioinformatics and computational biology | 2006
Patrick C. H. Ma; Keith C. C. Chan
For one to infer the overall structures of gene regulatory networks (GRNs), it is important to identify, for each gene in a GRN, which other genes can affect its expression and how they can affect it. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory tests can be carried out afterwards for verification. Since they are not intended to be used to predict if a gene has any interactions with other genes from an unseen sample (i.e., expression data that is not in the original database), this makes statistical verification of the reliability of the discovered gene interactions difficult. To better infer the structures of GRNs, we propose an effective fuzzy data mining technique in this paper. By transforming quantitative expression values into linguistic terms, the proposed technique is able to mine noisy, high dimensional time series expression data for interesting fuzzy sequential associations between genes. It is not only able to determine if a gene is dependent on another but also able to determine if a gene is supposed to be activated or inhibited. In addition, it can predict how a gene in an unseen sample would be affected by other genes in it. For evaluation, the proposed technique has been tested using real expression data and experimental results show that the use of fuzzy logic-based technique in gene expression data analysis can be very effective
Journal of Computational Biology | 2010
Patrick C. H. Ma; Keith C. C. Chan
In this article, we propose an effective data mining technique for multi-class protein sequence classification. The technique, which can discover discriminative motif-sets for classification, performs its tasks in two phases. In Phase 1, it makes use of a popular motif discovery algorithm called MEME (Multiple Expectation Maximization for Motif Elicitation) to discover a set of highly conserved motifs in each protein family of training sequences. The highly conserved motif-sets discovered in each family may overlap with each other and may therefore not be unique enough to allow them to be used for classification. Phase 2, therefore, makes use of a pattern discovery approach to discover the interesting motif-sets in each protein family that are useful for classification with a single classifier. Based on these motif-sets, the functional family of each independent testing sequence can then be determined. For experimentation, the proposed technique has been tested with different sets of protein sequences. Experimental results show that it outperforms other existing protein sequence classifiers and can effectively classify proteins into their corresponding functional families. In addition, the motif-sets discovered during the training process have been found to be biologically meaningful.