Nguyen Xuan Vinh
University of Melbourne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nguyen Xuan Vinh.
international conference on machine learning | 2009
Nguyen Xuan Vinh; Julien Epps; James Bailey
Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value between random partitions of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.
Bioinformatics | 2011
Nguyen Xuan Vinh; Madhu Chetty; Ross L. Coppel; Pramod P. Wangikar
MOTIVATION Dynamic Bayesian networks (DBN) are widely applied in modeling various biological networks including the gene regulatory network (GRN). Due to the NP-hard nature of learning static Bayesian network structure, most methods for learning DBN also employ either local search such as hill climbing, or a meta stochastic global optimization framework such as genetic algorithm or simulated annealing. RESULTS This article presents GlobalMIT, a toolbox for learning the globally optimal DBN structure from gene expression data. We propose using a recently introduced information theoretic-based scoring metric named mutual information test (MIT). With MIT, the task of learning the globally optimal DBN is efficiently achieved in polynomial time. AVAILABILITY The toolbox, implemented in Matlab and C++, is available at http://code.google.com/p/globalmit. CONTACT [email protected]; [email protected] SUPPLEMENTARY INFORMATION Supplementary data is available at Bioinformatics online.
international conference on data mining | 2010
Nguyen Xuan Vinh; Julien Epps
Traditional clustering has focused on creating a single good clustering solution, while modern, high dimensional data can often be interpreted, and hence clustered, in different ways. Alternative clustering aims at creating multiple clustering solutions that are both of high quality and distinctive from each other. Methods for alternative clustering can be divided into objective-function-oriented and data-transformation-oriented approaches. This paper presents a novel information theoretic-based, objective-function-oriented approach to generate alternative clusterings, in either an unsupervised or semi-supervised manner. We employ the conditional entropy measure for quantifying both clustering quality and distinctiveness, resulting in an analytically consistent combined criterion. Our approach employs a computationally efficient nonparametric entropy estimator, which does not impose any assumption on the probability distributions. We propose a partitional clustering algorithm, named minCEntropy, to concurrently optimize both clustering quality and distinctiveness. minCEntropy requires setting only some rather intuitive parameters, and performs competitively with existing methods for alternative clustering.
BMC Bioinformatics | 2013
Ahsan Raja Chowdhury; Madhu Chetty; Nguyen Xuan Vinh
BackgroundIn any gene regulatory network (GRN), the complex interactions occurring amongst transcription factors and target genes can be either instantaneous or time-delayed. However, many existing modeling approaches currently applied for inferring GRNs are unable to represent both these interactions simultaneously. As a result, all these approaches cannot detect important interactions of the other type. S-System model, a differential equation based approach which has been increasingly applied for modeling GRNs, also suffers from this limitation. In fact, all S-System based existing modeling approaches have been designed to capture only instantaneous interactions, and are unable to infer time-delayed interactions.ResultsIn this paper, we propose a novel Time-Delayed S-System (TDSS) model which uses a set of delay differential equations to represent the system dynamics. The ability to incorporate time-delay parameters in the proposed S-System model enables simultaneous modeling of both instantaneous and time-delayed interactions. Furthermore, the delay parameters are not limited to just positive integer values (corresponding to time stamps in the data), but can also take fractional values. Moreover, we also propose a new criterion for model evaluation exploiting the sparse and scale-free nature of GRNs to effectively narrow down the search space, which not only reduces the computation time significantly but also improves model accuracy. The evaluation criterion systematically adapts the max-min in-degrees and also systematically balances the effect of network accuracy and complexity during optimization.ConclusionThe four well-known performance measures applied to the experimental studies on synthetic networks with various time-delayed regulations clearly demonstrate that the proposed method can capture both instantaneous and delayed interactions correctly with high precision. The experiments carried out on two well-known real-life networks, namely IRMA and SOS DNA repair network in Escherichia coli show a significant improvement compared with other state-of-the-art approaches for GRN modeling.
Pattern Recognition | 2016
Nguyen Xuan Vinh; Shuo Zhou; Jeffrey Chan; James Bailey
Mutual information (MI) based approaches are a popular paradigm for feature selection. Most previous methods have made use of low-dimensional MI quantities that are only effective at detecting low-order dependencies between variables. Several works have considered the use of higher dimensional mutual information, but the theoretical underpinning of these approaches is not yet comprehensive. To fill this gap, in this paper, we systematically investigate the issues of employing high-order dependencies for mutual information based feature selection. We first identify a set of assumptions under which the original high-dimensional mutual information based criterion can be decomposed into a set of low-dimensional MI quantities. By relaxing these assumptions, we arrive at a principled approach for constructing higher dimensional MI based feature selection methods that takes into account higher order feature interactions. Our extensive experimental evaluation on real data sets provides concrete evidence that methodological inclusion of high-order dependencies improve MI based feature selection. HighlightsWe shed light on the assumptions behind many popular mutual information based feature selection approaches.We propose a theoretically sound approach to extend current mutual information based feature selection methods to handle high-order dependencies.We provide concrete experimental evidence that high-order dependencies improve mutual information based feature selection.
congress on evolutionary computation | 2012
Ahsan Raja Chowdhury; Madhu Chetty; Nguyen Xuan Vinh
With the advent of microarray technology, researchers are able to determine cellular dynamics for thousands of genes simultaneously, thereby enabling reverse engineering of the gene regulatory network (GRN) from high-throughput time-series gene expression data. Amongst the various currently available models for inferring GRN, the S-System formalism is often considered as an excellent compromise between accuracy and mathematical tractability. In this paper, a novel approach for inferring GRN based on the decoupled S-System model, incorporating the new concept of adaptive regulatory genes cardinality, is proposed. Parameter learning for the S-System is carried out in an evolving manner using a versatile and robust Trigonometric Evolutionary Algorithm. The applicability and efficiency of the proposed method is studied using a well-known and widely studied synthetic network with various levels of noise, and excellent performance observed. Further, investigations of a 5 gene in-vivo synthetic biological network of Saccharomyces cerevisiae called IRMA, has succeeded in detecting higher number of correct regulations compared to other approaches reported earlier.
bioinformatics and bioengineering | 2009
Nguyen Xuan Vinh; Julien Epps
Estimating the true number of clusters in a data set is one of the major challenges in cluster analysis. Yet in certain domains,knowing the true number of clusters is of high importance. For example, in medical research, detecting the true number of groups and sub-groups of cancer would be of utmost importance for their effective treatment. In this paper we propose a novel method to estimate the number of clusters in a micro array data set based on the consensus clustering approach. Although the main objective of consensus clustering is to discover a robust and high quality cluster structure in a data set, closer inspection of the set of clusterings obtained can often give valuable information about the appropriate number of clusters present. More specifically, the set off clusterings obtained when the specified number of clusters coincides with the true number of clusters tends to be less diverse.To quantify this diversity we develop a novel index, namely the Consensus Index (CI), which is built upon a suitable clustering similarity measure such as the well known Adjusted Rand Index (ARI)or our recently developed, information theoretic based index, namely the Adjusted Mutual Information (AMI). Our experiments on both synthetic and real microarray data sets indicate that the CI is a useful indicator for determining the appropriate number of clusters.
knowledge discovery and data mining | 2016
Shuo Zhou; Nguyen Xuan Vinh; James Bailey; Yunzhe Jia; Ian Davidson
Tensors are a natural representation for multidimensional data. In recent years, CANDECOMP/PARAFAC (CP) decomposition, one of the most popular tools for analyzing multi-way data, has been extensively studied and widely applied. However, todays datasets are often dynamically changing over time. Tracking the CP decomposition for such dynamic tensors is a crucial but challenging task, due to the large scale of the tensor and the velocity of new data arriving. Traditional techniques, such as Alternating Least Squares (ALS), cannot be directly applied to this problem because of their poor scalability in terms of time and memory. Additionally, existing online approaches have only partially addressed this problem and can only be deployed on third-order tensors. To fill this gap, we propose an efficient online algorithm that can incrementally track the CP decompositions of dynamic tensors with an arbitrary number of dimensions. In terms of effectiveness, our algorithm demonstrates comparable results with the most accurate algorithm, ALS, whilst being computationally much more efficient. Specifically, on small and moderate datasets, our approach is tens to hundreds of times faster than ALS, while for large-scale datasets, the speedup can be more than 3,000 times. Compared to other state-of-the-art online approaches, our method shows not only significantly better decomposition quality, but also better performance in terms of stability, efficiency and scalability.
BMC Systems Biology | 2012
Nizamul Morshed; Madhu Chetty; Nguyen Xuan Vinh
BackgroundUnderstanding gene interactions is a fundamental question in systems biology. Currently, modeling of gene regulations using the Bayesian Network (BN) formalism assumes that genes interact either instantaneously or with a certain amount of time delay. However in reality, biological regulations, both instantaneous and time-delayed, occur simultaneously. A framework that can detect and model both these two types of interactions simultaneously would represent gene regulatory networks more accurately.ResultsIn this paper, we introduce a framework based on the Bayesian Network (BN) formalism that can represent both instantaneous and time-delayed interactions between genes simultaneously. A novel scoring metric having firm mathematical underpinnings is also proposed that, unlike other recent methods, can score both interactions concurrently and takes into account the reality that multiple regulators can regulate a gene jointly, rather than in an isolated pair-wise manner. Further, a gene regulatory network (GRN) inference method employing an evolutionary search that makes use of the framework and the scoring metric is also presented.ConclusionBy taking into consideration the biological fact that both instantaneous and time-delayed regulations can occur among genes, our approach models gene interactions with greater accuracy. The proposed framework is efficient and can be used to infer gene networks having multiple orders of instantaneous and time-delayed regulations simultaneously. Experiments are carried out using three different synthetic networks (with three different mechanisms for generating synthetic data) as well as real life networks of Saccharomyces cerevisiae, E. coli and cyanobacteria gene expression data. The results show the effectiveness of our approach.
Data Mining and Knowledge Discovery | 2016
Nguyen Xuan Vinh; Jeffrey Chan; Simone Romano; James Bailey; Christopher Leckie; Kotagiri Ramamohanarao; Jian Pei
We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be used to explain any data point of interest, which itself might be an inlier or outlier. In this paper, we investigate several open challenges faced by existing outlying aspects mining techniques and propose novel solutions, including (a) how to design effective scoring functions that are unbiased with respect to dimensionality and yet being computationally efficient, and (b) how to efficiently search through the exponentially large search space of all possible subspaces. We formalize the concept of dimensionality unbiasedness, a desirable property of outlyingness measures. We then characterize existing scoring measures as well as our novel proposed ones in terms of efficiency, dimensionality unbiasedness and interpretability. Finally, we evaluate the effectiveness of different methods for outlying aspects discovery and demonstrate the utility of our proposed approach on both large real and synthetic data sets.