Nguyen Hua Phung
Ho Chi Minh City University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nguyen Hua Phung.
national foundation for science and technology development conference on information and computer science | 2015
Vo Thi Ngoc Chau; Nguyen Hua Phung; Vo Thi Ngoc Tran
Data clustering is one of the popular tasks recently used in the educational data mining arena for grouping similar students by several aspects such as study performance, behavior, skill, etc. Many well-known clustering algorithms such as k-means, expectation-maximization, spectral clustering, etc. were employed in the related works. None of them has taken into consideration the incompleteness of the educational data gathered in an academic credit system. If just a few records have missing values, we might ignore them in the mining task. However, as there are a large number of missing values, ignoring them may lead to the data insufficiency and ineffectiveness of the mining task. Hence, we define a robust and effective algorithmic framework for incomplete educational data clustering using the nearest prototype strategy. Within the framework, we propose two novel incomplete educational data clustering algorithms K_nps and S_nps based on the k-means algorithm and the self-organizing map, respectively. Experimental results have shown that the clusters from our proposed algorithms have better cluster quality as compared to the different existing approaches.
international conference on data mining | 2014
Nguyen Truc Mai Anh; Vo Thi Ngoc Chau; Nguyen Hua Phung
Educational data classification is an educational data mining task which classifies our students based on their study performance. Although many data classification techniques and methods are nowadays available, educational data classification is full of challenges emergent in an academic credit system. One of the challenges often encountered in educational data classification is data incompleteness to early identify in-trouble students. Hence, we aim at a robust approach for this inevitable challenging problem. Different from the existing works on incomplete data handling, our work explores the semantics of incomplete data in the education domain on the application side and the two-phase characteristics of the classification task on the technical side. As a result from an empirical study on real educational data sets with different percentages of incomplete data, it is found that the robust approaches with incomplete data handling based on their semantics in relation to class information can enhance the effectiveness of educational data classifiers.
multi disciplinary trends in artificial intelligence | 2016
Vo Thi Ngoc Chau; Nguyen Hua Phung
Educational data mining aims to provide useful knowledge hidden in educational data for better educational decision making support. However, a large set of educational data is not always ready for a data mining task due to the peculiarities of the academic system as well as the data collection time. In our work, we focus on a study status prediction task at the program level where the data are collected and processed once a year in the time frame of the program of interest in an academic credit system. When there are little educational data labeled for the task, the effectiveness of the task might be affected and thus, the task should be considered in a semi-supervised learning process instead of a conventional supervised learning process to exploit a larger set of unlabeled data. In particular, we define a random forest-based self-training algorithm, named minSemi-RF, for the study status prediction task at the program level. The minSemi-RF algorithm is designed as a combination of Tri-training and Self-training styles in such a way that we turn a random forest-based self-training algorithm to be a parameter-free variant of the Tri-training algorithm. This algorithm produces a final classifier that can inherit the advantages of a random forest model. Based on the experimental results from the experiments conducted on the real data sets, our algorithm is proved to be effective and practical for early in-trouble student detection in an academic credit system as compared to some existing semi-supervised learning methods.
The 2013 RIVF International Conference on Computing & Communication Technologies - Research, Innovation, and Vision for Future (RIVF) | 2013
Vo Thi Ngoc Chau; Nguyen Hua Phung
Educational data mining is emerging in the data mining research arena. Despite an applied field of data mining techniques and methods, educational data mining is full of challenges that have not been completely resolved. Especially data classification in an academic credit system is a very tough task which must deal with imbalanced issues and missing data on the technical side and tackle the flexibility of the education system leading to the heterogeneity of data on the practical side. In this paper, we present our approach with a hybrid resampling scheme and random forest for the imbalanced educational data classification task with multiple classes based on students performance. The proposed approach has not yet been available in educational data mining. Besides, it has been extensively proved in our empirical study to be effective for students final study status prediction and usable in a knowledge-driven educational decision support system.
International Conference on Future Data and Security Engineering | 2017
Vo Thi Ngoc Chau; Nguyen Hua Phung
An educational data classification task at the program level is investigated in this paper. This task concentrates on predicting the final study status of each student from the second year to the fourth year in their study path. By doing that, in-trouble students can be predicted as soon as possible. However, the task faces two main problems. The first problem is the existence of incomplete data once we conduct an early prediction and the second one is the lack of labeled data for a supervised learning process of this task. In order to overcome those difficulties, our work proposes a robust semi-supervised learning method with sparse data handling in either sequential or iterative approach. The sparse data handling process can help us with the k-nearest neighbors-based data imputation and the semi-supervised learning process with a random forest model as a base learner can exploit the availability of a larger set of unlabeled data in the task. These two processes can be conducted in sequence or integrated in each other for robustness and effectiveness in educational data classification. The experimental results show that our resulting robust random forest-based self-training algorithm with the iterative approach to sparse data handling outperforms the other algorithms with different sequential and traditional approaches for conducting the task. This algorithm provides us with a more effective classifier as a practical solution on educational data over the time.
2017 4th NAFOSTED Conference on Information and Computer Science | 2017
Vo Thi Ngoc Chau; Nguyen Hua Phung
Educational data mining has received much attention worldwide due to its significance in the education domain. Among a large number of the educational data mining tasks, early in-trouble student prediction is a popular one. This task focuses on identifying the students who are at risk in their study as soon as possible before the end of the permitted period of study time. For early detection, data shortage is a challenge for the task at both instance and set levels. Indeed, at the instance level, incomplete data could be gathered for each student at his/her early study period and also at the set level, many labeled data could not be collected for their final study status. Therefore, a solution to the task in such a context is required. In this paper, we propose a robust random forest-based Tri-training algorithm that can overcome that data shortage challenge. In particular, based on the semi-supervised learning process of the original Tri-training algorithm, an incomplete data handling method is integrated into its iterative mechanism so that the Tri-training algorithm can be more robust. In addition, a new combination of the Tri-training algorithm and a random forest model is examined so that each classifier of the Tri-training model can be enhanced for more accurate predictions. As a result, the proposed algorithm is an effective solution to the early in-trouble student prediction task. Its effectiveness has been confirmed with the better experimental results on real data sets in comparison with the existing methods using the preprocessing approach.
Archive | 2016
Hoang Thi Hong Van; Vo Thi Ngoc Chau; Nguyen Hua Phung
In educational data mining, frequent patterns and association rules are popular to help us get insights into the characteristics of the students and their study. Nonetheless, frequent patterns and rules discovered in the existing works are simple with no temporal information along the student’s study paths. Indeed, many sequential pattern and rule mining techniques just considered a sequence of ordered events with no explicit time. In order to achieve sequential rules with explicit timestamps in temporal educational databases that contain timestamp-extended sequences, our work defines a tree-based rule mining algorithm from the frequent sequences generated and organized in a prefix tree enhanced with explicit timestamps. Experimental results on real educational datasets have shown that the proposed algorithm can provide more informative sequential rules with explicit timestamps. Besides, it is more efficient than the brute-force list-based algorithm by optimizing the manipulations on the prefix tree for sequential rules with explicit timestamps.
international conference on advanced computing | 2015
Lu Thi Kim Phung; Vo Thi Ngoc Chau; Nguyen Hua Phung
To early detect in-trouble students in an academic credit system has been emerging in the educational data mining research arena. This problem has been taken into consideration with a multi-class educational data classification task. Although many existing supervised learning algorithms are available and able to provide us with many acceptable classification models, the interpretability of these models needs to be investigated so that they can be applied in practice. On the other hand, random forests have been examined and appeared to be an appropriate solution to effectively classify the students for early in-trouble student detection in a credit system. However, random forests are black-box ensemble models which lack a capability of explanation for the reasoning behind their prediction. Therefore, in this paper, we define a rule extraction algorithm named ExtractingRuleRF to derive an interpretable refined classification rule set from a random forest for a multi-class data classification task. The proposed algorithm follows a greedy approach with two phases: rule refinement and rule extraction. In the first phase, we prepare a ranked weighted rule set with more interpretability and equivalent classification power of the input random forest by retaining its classification scheme. In the second phase, our rule extraction process returns the best rules for the highest accuracy and/or a full coverage based on the priority of each ranked rule. Consequently, the theoretical analysis of the algorithm and experimental results on real educational data sets have shown that ExtractingRuleRF can produce a more effective and interpretable rule-based classification model than its corresponding random forest. Such a result helps our knowledge-based educational decision support with interpretable classification rules to be more practical.
international conference inventive communication and computational technologies | 2018
Vo Thi Ngoc Chau; Nguyen Hua Phung
VNU Journal of Science: Computer Science and Communication Engineering | 2018
Vo Thi Ngoc Chau; Nguyen Hua Phung