Thuy Thi Nguyen
Vietnam National University, Ho Chi Minh City
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thuy Thi Nguyen.
BMC Genomics | 2015
Thanh-Tung Nguyen; Joshua Zhexue Huang; Qingyao Wu; Thuy Thi Nguyen; Mark Junjie Li
BackgroundSingle-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.ResultsThis approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders.ConclusionThe presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breimans RF, GRRF and wsRF methods.
Machine Learning | 2015
Thanh-Tung Nguyen; Joshua Zhexue Huang; Thuy Thi Nguyen
Quantile regression forests (QRF), a tree-based ensemble method for estimation of conditional quantiles, has been proven to perform well in terms of prediction accuracy, especially for range prediction. However, the model may have bias and suffer from working with high dimensional data (thousands of features). In this paper, we propose a new bias correction method, called bcQRF that uses bias correction in QRF for range prediction. In bcQRF, a new feature weighting subspace sampling method is used to build the first level QRF model. The residual term of the first level QRF model is then used as the response feature to train the second level QRF model for bias correction. The two-level models are used to compute bias-corrected predictions. Extensive experiments on both synthetic and real world data sets have demonstrated that the bcQRF method significantly reduced prediction errors and outperformed most existing regression random forests. The new method performed especially well on high dimensional data.
The Scientific World Journal | 2015
Thanh-Tung Nguyen; Joshua Zhexue Huang; Thuy Thi Nguyen
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
knowledge and systems engineering | 2015
Phan Thi Thu Hong; Tran Thi Thanh Hai; Le Thi Lan; Vo Ta Hoang; Vu Hai; Thuy Thi Nguyen
This paper presents a system for automated classification of rice variety for rice seed production using computer vision and image processing techniques. Rice seeds of different varieties are visually very similar in color, shape and texture that make the classification of rice seed varieties at high accuracy challenging. We investigated various feature extraction techniques for efficient rice seed image representation. We analyzed the performance of powerful classifiers on the extracted features for finding the robust one. Images of six different rice seed varieties in northern Vietnam were acquired and analyzed. Our experiments have demonstrated that the average accuracy of our classification system can reach 90.54% using Random Forest method with a simple feature extraction technique. This result can be used for developing a computer-aided machine vision system for automated assessment of rice seeds purity.
Environmental Science: Processes & Impacts | 2015
Phuong Hoang Tran; Tha Thanh Thi Luong; Thuy Thi Nguyen; Huy Quang Nguyen; Hop Van Duong; Byung Hong Kim
Iron-oxidizing bacterial consortia can be enriched in microbial fuel cells (MFCs) operated with ferrous iron as the sole electron donor. In this study, we investigated the possibility of using such lithotrophic iron-oxidizing MFC (LIO-MFC) systems as biosensors to monitor iron and manganese in water samples. When operated with anolytes containing only ferrous iron as the sole electron donor, the experimented LIO-MFCs generated electrical currents in response to the presence of Fe(2+) in the anolytes. For the concentrations of Fe(2+) in the range of 3-20 mM, a linear correlation between the current and the concentration of Fe(2+) could be achieved (r(2) = 0.98). The LIO-MFCs also responded to the presence of Mn(2+) in the anolytes but only when the Mn(2+) concentration was less than 3 mM. The presence of other metal ions such as Ni(2+) or Pb(2+) in the anolytes reduced the Fe(2+)-associated electricity generation of the LIO-MFCs at various levels. Organic compounds, when present at a non-excessive level together with Fe(2+) in the anolytes, did not affect the generation of electricity, although the compounds might serve as alternative electron donors for the anode bacteria. The performance of the LIO-MFCs was also affected to different degrees by operational parameters, including surrounding temperature, pH of the sample, buffer strength and external resistance. The results proved the potential of LIO-MFCs as biosensors sensing Fe(2+) in water samples with a significant specificity. However, the operation of the system should be in compliance with an optimal procedure to ensure reliable performance.
pacific-asia conference on knowledge discovery and data mining | 2015
Thanh-Tung Nguyen; He Zhao; Joshua Zhexue Huang; Thuy Thi Nguyen; Mark Junjie Li
Random Forests (RF) models have been proven to perform well in both classification and regression. However, with the randomizing mechanism in both bagging samples and feature selection, the performance of RF can deteriorate when applied to high-dimensional data. In this paper, we propose a new approach for feature sampling for RF to deal with high-dimensional data. We first apply \(p\)-value to assess the feature importance on finding a cut-off between informative and less informative features. The set of informative features is then further partitioned into two groups, highly informative and informative features, using some statistical measures. When sampling the feature subspace for learning RFs, features from the three groups are taken into account. The new subspace sampling method maintains the diversity and the randomness of the forest and enables one to generate trees with a lower prediction error. In addition, quantile regression is employed to obtain predictions in the regression problem for a robustness towards outliers. The experimental results demonstrated that the proposed approach for learning random forests significantly reduced prediction errors and outperformed most existing random forests when dealing with high-dimensional data.
Pattern Recognition Letters | 2018
Van-Hung Le; Hai Vu; Thuy Thi Nguyen; Thi-Lan Le; Thanh-Hai Tran
Abstract Estimating parameters of a geometrical model from 3-D point cloud data is an important problem in computer vision. Random sample consensus (RANSAC) and its variations have been proposed for the estimation of the models parameters. However, RANSAC is computationally expensive and the problem is challenging when the measured 3-D data contain noise and outliers. This paper presents an efficient sampling technique for RANSAC, in which geometrical constraints are utilized for selecting good samples for a robust estimation. The constraints are based on two predefined criteria. First, the samples must ensure being consistent with the estimated model; second, the selected samples must satisfy explicit geometrical constraints of the interested objects. The proposed approach is wrapped as a robust estimator, named GCSAC (Geometrical Constraint SAmple Consensus), for estimating a cylindrical object from a 3-D point cloud. Extensive experiments on various data sets show that our method outperforms other robust estimators (e.g. MLESAC) tested in term of both precision of the estimated model and computational time. The implementations and evaluation datasets used in this paper are made publicly available.
knowledge and systems engineering | 2017
Duc-Hau Le; Van-Huy Pham; Thuy Thi Nguyen
Many studies have shown the associations of microRNAs on human diseases. A number of computational methods have been proposed to predict such associations by ranking candidate microRNAs ac-cording to their relevance to a disease. Among them, network-based methods are usually based on microRNA functional similarity networks which are constructed based on microRNA-target interactions. Therefore, the prediction performances of these methods are highly dependent on the quality of such interactions which are usually predicted by computational methods. Meanwhile, machine learning-based methods usually formulate the disease miRNA prediction as a classification problem, where novel associations between disease and miRNA are predicted based on known disease-miRNA associations. However, those methods are mainly based on single binary classifiers; therefore, they have a limitation in prediction performance. In this study, we proposed a new method, namely RFMDA, to predict disease-associated miRNAs. Our method based on Random Forest (RF), an ensemble technique, where the final classifier is constructed by multitude of decision trees, to perform the prediction. In order to compare with other previous methods, we use the same procedure to build training samples, where positive training samples are known disease-miRNA associations. In addition, features of each sample measure either functional or phenotypical similarities between miRNAs or phenotypes, respectively. Simulation results showed that RFMDA outperformed previous learning-based methods including two binary classifiers (i.e., Naïve Bayes and two-class Support Vector Machines) and one semi-supervised classifier (i.e., Regularized Least Square). Moreover, using the trained model, we can predict novel miRNAs associated to some diseases such as breast cancer, colorectal cancer and hepatocellular carcinoma.
knowledge and systems engineering | 2015
Van-Hung Le; Hai Vu; Thuy Thi Nguyen; Thi-Lan Le; Thi-Thanh-Hai Tran; Michiel Vlaminck; Wilfried Philips; Peter Veelaert
Finding an object in a 3D scene is an important problem in the robotics, especially in assistive systems for visually impaired people. In most systems, the first and most important step is how to detect an object in a complex environment. In this paper, we propose a method for finding an object using geometrical constraints on depth images from a Kinect. The main advantage of the approach is it is invariant to lighting condition, color and texture of the objects. Our approach does not require a training phase, therefore it can reduce the time of preparing data and learning model. The objects of interest have a simple geometrical structure such as coffee mugs, bowls, boxes and are on a table. Overall, our approach is faster and more accurate than methods using 2D features on depth images for training an object model.
Advanced Data Analysis and Classification | 2018
Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen
In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.