Patrick C. H. Ma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Patrick C. H. Ma is active.

Explore More

Publication

Featured researches published by Patrick C. H. Ma.

IEEE Transactions on Evolutionary Computation | 2006

An evolutionary clustering algorithm for gene expression microarray data analysis

Patrick C. H. Ma; Keith C. C. Chan; Xin Yao; David K. Y. Chiu

Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster.

IEEE Transactions on Biomedical Engineering | 2011

Incremental Fuzzy Mining of Gene Expression Data for Gene Function Prediction

Patrick C. H. Ma; Keith C. C. Chan

Due to the complexity of the underlying biological processes, gene expression data obtained from DNA microarray technologies are typically noisy and have very high dimensionality and these make the mining of such data for gene function prediction very difficult. To tackle these difficulties, we propose to use an incremental fuzzy mining technique called incremental fuzzy mining (IFM). By transforming quantitative expression values into linguistic terms, such as highly or lowly expressed, IFM can effectively capture heterogeneity in expression data for pattern discovery. It does so using a fuzzy measure to determine if interesting association patterns exist between the linguistic gene expression levels. Based on these patterns, IFM can make accurate gene function predictions and these predictions can be made in such a way that each gene can be allowed to belong to more than one functional class with different degrees of membership. Gene function prediction problem can be formulated both as classification and clustering problems, and IFM can be used either as a classification technique or together with existing clustering algorithms to improve the cluster groupings discovered for greater prediction accuracies. IFM is characterized also by its being an incremental data mining technique so that the discovered patterns can be continually refined based only on newly collected data without the need for retraining using the whole dataset. For performance evaluation, IFM has been tested with real expression datasets for both classification and clustering tasks. Experimental results show that it can effectively uncover hidden patterns for accurate gene function predictions.

IEEE Transactions on Fuzzy Systems | 2008

Inferring Gene Regulatory Networks From Expression Data by Discovering Fuzzy Dependency Relationships

Patrick C. H. Ma; Keith C. C. Chan

For one to infer the structures of a gene regulatory network (GRN), it is important to identify, for each gene in the GRN, which other genes can affect its expression and how they can affect it. For this purpose, many algorithms have been developed to generate hypotheses about the presence or absence of interactions between genes. These algorithms, however, cannot be used to determine if a gene activates or inhibits another. To obtain such information to better infer GRN structures, we propose a fuzzy data mining technique here. By transforming quantitative expression values into linguistic terms, it defines a measure of fuzzy dependency among genes. Using such a measure, the technique is able to discover interesting fuzzy dependency relationships in noisy, high dimensional time series expression data so that it can not only determine if a gene is dependent on another but also if a gene is supposed to be activated or inhibited. In addition, the technique can also predict how a gene in an unseen sample (i.e., expression data that are not in the original database) would be affected by other genes in it and this makes statistical verification of the reliability of the discovered gene interactions easier. For evaluation, the proposed technique has been tested using real expression data and experimental results show that the use of fuzzy-logic based technique in gene expression data analysis can be quite effective.

IEEE Transactions on Biomedical Engineering | 2009

A Novel Approach for Discovering Overlapping Clusters in Gene Expression Data

Patrick C. H. Ma; Keith C. C. Chan

Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.

Journal of Computational Biology | 2008

UPSEC: an algorithm for classifying unaligned protein sequences into functional families.

Patrick C. H. Ma; Keith C. C. Chan

To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful.

Journal of Bioinformatics and Computational Biology | 2007

An effective data mining technique for reconstructing gene regulatory networks from time series expression data.

Patrick C. H. Ma; Keith C. C. Chan

Recent development in DNA microarray technologies has made the reconstruction of gene regulatory networks (GRNs) feasible. To infer the overall structure of a GRN, there is a need to find out how the expression of each gene can be affected by the others. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory experiments can be performed afterwards for verification. Since, they are not intended to be used to predict if a gene in an unseen sample has any interactions with other genes, statistical verification of the reliability of the discovered interactions can be difficult. Furthermore, since the temporal ordering of the data is not taken into consideration, the directionality of regulation cannot be established using these existing techniques. To tackle these problems, we propose a data mining technique here. This technique makes use of a probabilistic inference approach to uncover interesting dependency relationships in noisy, high-dimensional time series expression data. It is not only able to determine if a gene is dependent on another but also whether or not it is activated or inhibited. In addition, it can predict how a gene would be affected by other genes even in unseen samples. For performance evaluation, the proposed technique has been tested with real expression data. Experimental results show that it can be very effective. The discovered dependency relationships can reveal gene regulatory relationships that could be used to infer the structures of GRNs.

IEEE Transactions on Nanobioscience | 2009

An Iterative Data Mining Approach for Mining Overlapping Coexpression Patterns in Noisy Gene Expression Data

Patrick C. H. Ma; Keith C. C. Chan

Clustering is concerned with the discovery of groupings of records in a database. Many clustering problems are defined as partitioning problems in the sense that the similar records are grouped into nonoverlapping partitions. However, the clustering of gene expression data to discover coexpressed genes may not always be meaningful if this problem is reduced into a partitioning problem. Due to the complexity of the underlying biological processes, a protein can interact with one or more other proteins belonging to different functional classes in order to perform a particular biological role. For this reason, when responding to different external stimulants, a gene that produces a particular protein can coexpress with more than one group of other genes. The gene can therefore belong to more than one group of coexpressed genes. This poses a challenge to many clustering algorithms as they are not originally developed to discover overlapping clusters in noisy gene expression data. In this paper, we propose an iterative data mining approach that consists of two phases as follows. In phase 1, a clustering algorithm is used to discover the initial, nonoverlapping partitioning of gene expression profiles in gene expression data. Then, the partition memberships of genes are redetermined iteratively in phase 2 by a pattern discovery technique so as to determine that if a gene should remain in the same partition, be moved to another partition, or be also grouped together with other genes in another partitions. The proposed approach has been tested with both artificial and real datasets. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively discover overlapping clusters in noisy gene expression data.

international conference on tools with artificial intelligence | 2003

Discovering clusters in gene expression data using evolutionary approach

Patrick C. H. Ma; Keith C. C. Chan

The combined interpretation of gene expression data and gene sequences offers a valuable approach to investigate the intricate relationships involving gene transcriptional regulation. The highly interactive expression data produced by microarray hybridization experiments allow us to find the clusters of coexpressed genes. By analyzing the upstream regions of the identified coexpressed genes, we can discover the regulatory patterns characterized by transcription factor binding sites, which govern the process of transcriptional regulation. This paper presents a generic clustering algorithm that uses a Hybrid GA approach to discover clusters in gene expression data. The advantage of this method is that large search space can be effectively explored by utilizing the evolutionary algorithm techniques. Moreover, it is able to discover underlying patterns in noisy gene expression data for meaningful data groupings, and also statistically significant patterns hidden in each cluster can be extracted at the same time. Since the proposed method can handle both continuous-and discrete-valued data, it can be used with different microarray and biomedical data. To test its effectiveness, we have used it on real expression data. The experimental results reveal meaningful groupings and uncover many known transcription factor binding sites.

computational intelligence in bioinformatics and computational biology | 2006

A Fuzzy Data Mining Technique for the Reconstruction of Gene Regulatory Networks from Time Series Expression Data

Patrick C. H. Ma; Keith C. C. Chan

For one to infer the overall structures of gene regulatory networks (GRNs), it is important to identify, for each gene in a GRN, which other genes can affect its expression and how they can affect it. Many existing approaches to reconstructing GRNs are developed to generate hypotheses about the presence or absence of interactions between genes so that laboratory tests can be carried out afterwards for verification. Since they are not intended to be used to predict if a gene has any interactions with other genes from an unseen sample (i.e., expression data that is not in the original database), this makes statistical verification of the reliability of the discovered gene interactions difficult. To better infer the structures of GRNs, we propose an effective fuzzy data mining technique in this paper. By transforming quantitative expression values into linguistic terms, the proposed technique is able to mine noisy, high dimensional time series expression data for interesting fuzzy sequential associations between genes. It is not only able to determine if a gene is dependent on another but also able to determine if a gene is supposed to be activated or inhibited. In addition, it can predict how a gene in an unseen sample would be affected by other genes in it. For evaluation, the proposed technique has been tested using real expression data and experimental results show that the use of fuzzy logic-based technique in gene expression data analysis can be very effective

Journal of Computational Biology | 2010

Discovering Interesting Motif-Sets for Multi-Class Protein Sequence Classification

Patrick C. H. Ma; Keith C. C. Chan

In this article, we propose an effective data mining technique for multi-class protein sequence classification. The technique, which can discover discriminative motif-sets for classification, performs its tasks in two phases. In Phase 1, it makes use of a popular motif discovery algorithm called MEME (Multiple Expectation Maximization for Motif Elicitation) to discover a set of highly conserved motifs in each protein family of training sequences. The highly conserved motif-sets discovered in each family may overlap with each other and may therefore not be unique enough to allow them to be used for classification. Phase 2, therefore, makes use of a pattern discovery approach to discover the interesting motif-sets in each protein family that are useful for classification with a single classifier. Based on these motif-sets, the functional family of each independent testing sequence can then be determined. For experimentation, the proposed technique has been tested with different sets of protein sequences. Experimental results show that it outperforms other existing protein sequence classifiers and can effectively classify proteins into their corresponding functional families. In addition, the motif-sets discovered during the training process have been found to be biologically meaningful.

Explore More

Collaboration

Dive into the Patrick C. H. Ma's collaboration.

Top Co-Authors

Keith C. C. Chan

Hong Kong Polytechnic University

View shared research outputs

Top Co-Authors

University of Science and Technology

View shared research outputs

Top Co-Authors

David K. Y. Chiu

University of Guelph

View shared research outputs

Top Co-Authors

Dongguk University

View shared research outputs

Top Co-Authors

Hanyang University

View shared research outputs

Top Co-Authors

Dongguk University

View shared research outputs

Top Co-Authors

Dongguk University

View shared research outputs

Top Co-Authors

Electronics and Telecommunications Research Institute

View shared research outputs

Top Co-Authors

Southwestern University of Finance and Economics

View shared research outputs

Top Co-Authors

Zhejiang University

View shared research outputs

Explore More