Thanh-Nghi Do
Can Tho University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thanh-Nghi Do.
ieee pacific visualization symposium | 2008
Niklas Elmqvist; Thanh-Nghi Do; Howard Goodell; Nathalie Henry; Jean-Daniel Fekete
We present the zoomable adjacency matrix explorer (ZAME), a visualization tool for exploring graphs at a scale of millions of nodes and edges. ZAME is based on an adjacency matrix graph representation aggregated at multiple scales. It allows analysts to explore a graph at many levels, zooming and panning with interactive performance from an overview to the most detailed views. Several components work together in the ZAME tool to make this possible. Efficient matrix ordering algorithms group related elements. Individual data cases are aggregated into higher-order meta-representations. Aggregates are arranged into a pyramid hierarchy that allows for on-demand paging to GPU shader programs to support smooth multiscale browsing. Using ZAME, we are able to explore the entire French Wikipedia - over 500,000 articles and 6,000,000 links - with interactive performance on standard consumer-level computer hardware.
2006 International Conference onResearch, Innovation and Vision for the Future | 2006
Thanh-Nghi Do; François Poulet
The new incremental, parallel and distributed Support Vector Machine (SVM) algorithm using linear or non linear kernels proposed in this paper aims at classifying very large datasets on standard personal computers (PCs). SVM and kernel related methods have shown to build accurate models but the learning task usually needs a quadratic program so that the learning task for large datasets requires large memory capacity and long time. We extend a recent Least Squares SVM (LS-SVM) proposed by Suykens and Vandewalle for building incremental, parallel and distributed SVM algorithm. The new algorithm is very fast and can handle very large datasets in linear or non- linear classification tasks on PCs. An example of the effectiveness is given with the linear classification into two classes of one billion datapoints in 20-dimensional input space in some minutes on ten PCs (3 GHz Pentium IV, 512 MB RAM).
advanced data mining and applications | 2008
Thanh-Nghi Do; Van Hoa Nguyen; François Poulet
We present a new parallel and incremental Support Vector Machine (SVM) algorithm for the classification of very large datasets on graphics processing units (GPUs). SVM and kernel related methods have shown to build accurate models but the learning task usually needs a quadratic program so that this task for large datasets requires large memory capacity and long time. We extend a recent Least Squares SVM (LS-SVM) proposed by Suykens and Vandewalle for building incremental and parallel algorithm. The new algorithm uses graphics processors to gain high performance at low cost. Numerical test results on UCI and Delve dataset repositories showed that our parallel incremental algorithm using GPUs is about 70 times faster than a CPU implementation and often significantly faster (over 1000 times) than state-of-the-art algorithms like LibSVM, SVM-perf and CB-SVM.
knowledge discovery and data mining | 2008
Philippe Lenca; Stéphane Lallich; Thanh-Nghi Do; Nguyen-Khang Pham
In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannons entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.
international conference on machine learning and applications | 2007
Thanh-Nghi Do; Jean-Daniel Fekete
Boosting of least-squares support vector machine (LS-SVM) algorithms can classify large datasets on standard personal computers (PCs). We extend the LS-SVM proposed by Suykens and Vandewalle in several ways to efficiently classify large datasets. We developed a row-incremental version for datasets with billions of data points and up to 10,000 dimensions. By adding a Tikhonov regularization term and using the Sherman-Morrison-Woodbury formula, we developed a column-incremental LS-SVM to process datasets with a small number of data points but very high dimensionality. Finally, by applying boosting to these incremental LS-SVM algorithms, we developed classification algorithms for massive, very-high-dimensional datasets, and we also applied these ideas to build boosting of other efficient SVM algorithms proposed by Mangasarian, including Lagrange SVM (LSVM), proximal SVM (PSVM) and Newton SVM (NSVM). Numerical test results on UCI, RCV1- binary, Reuters-21578, Forest cover type and KDD cup 1999 datasets showed that our algorithms are often significantly faster and/or more accurate than state-of- the-art algorithms LibSVM, SVM-perf and CB-SVM.
EGC (best of volume) | 2010
Thanh-Nghi Do; Philippe Lenca; Stéphane Lallich; Nguyen-Khang Pham
The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investigate a new approach for supervised classification with a huge number of numerical attributes. We propose a random oblique decision trees method. It consists of randomly choosing a subset of predictive attributes and it uses SVM as a split function of these attributes.We compare, on 25 datasets, the effectiveness with classical measures (e.g. precision, recall, F1-measure and accuracy) of random forests of random oblique decision trees with SVMs and random forests of C4.5. Our proposal has significant better performance on very-high-dimensional datasets with slightly better results on lower dimensional datasets.
Vietnam Journal of Computer Science | 2014
Thanh-Nghi Do
The new parallel multiclass stochastic gradient descent algorithms aim at classifying million images with very-high-dimensional signatures into thousands of classes. We extend the stochastic gradient descent (SGD) for support vector machines (SVM-SGD) in several ways to develop the new multiclass SVM-SGD for efficiently classifying large image datasets into many classes. We propose (1) a balanced training algorithm for learning binary SVM-SGD classifiers, and (2) a parallel training process of classifiers with several multi-core computers/grid. The evaluation on 1000 classes of ImageNet, ILSVRC 2010 shows that our algorithm is 270 times faster than the state-of-the-art linear classifier LIBLINEAR.
international conference on enterprise information systems | 2004
François Poulet; Thanh-Nghi Do
In this paper, we present new support vector machines (SVM) algorithms that can be used to classify very large datasets on standard personal computers. The algorithms have been extended from three recent SVMs algorithms: least squares SVM classification, finite Newton method for classification and incremental proximal SVM classification. The extension consists in building incremental, parallel and distributed SVMs for classification. Our three new algorithms are very fast and can handle very large datasets. An example of the effectiveness of these new algorithms is given with the classification into two classes of one billion points in 10-dimensional input space in some minutes on ten personal computers (800 MHz Pentium III, 256 MB RAM, Linux).
Vietnam Journal of Computer Science | 2015
Thanh-Nghi Do; Philippe Lenca; Stéphane Lallich
Classifying fingerprint images may require an important features extraction step. The scale-invariant feature transform which extracts local descriptors from images is robust to image scale, rotation and also to changes in illumination, noise, etc. It allows to represent an image in term of the comfortable bag-of-visual-words. This representation leads to a very large number of dimensions. In this case, random forest of oblique decision trees is very efficient for a small number of classes. However, in fingerprint classification, there are as many classes as individuals. A multi-class version of random forest of oblique decision trees is thus proposed. The numerical tests on seven real datasets (up to 5,000 dimensions and 389 classes) show that our proposal has very high accuracy and outperforms state-of-the-art algorithms.
discovery science | 2004
Thanh-Nghi Do; François Poulet
Understanding the result produced by a data-mining algorithm is as important as the accuracy. Unfortunately, support vector machine (SVM) algorithms provide only the support vectors used as “black box” to efficiently classify the data with a good accuracy. This paper presents a cooperative approach using SVM algorithms and visualization methods to gain insight into a model construction task with SVM algorithms. We show how the user can interactively use cooperative tools to support the construction of SVM models and interpret them. A pre-processing step is also used for dealing with large datasets. The experimental results on Delve, Statlog, UCI and bio-medical datasets show that our cooperative tool is comparable to the automatic LibSVM algorithm, but the user has a better understanding of the obtained model.