Mai-Vu Tran
Vietnam National University, Hanoi
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mai-Vu Tran.
PLOS ONE | 2013
Nigel Collier; Mai-Vu Tran; Hoang-Quynh Le; Quang-Thuy Ha; Anika Oellrich; Dietrich Rebholz-Schuhmann
The identification of phenotype descriptions in the scientific literature, case reports and patient records is a rewarding task for bio-medical text mining. Any progress will support knowledge discovery and linkage to other resources. However because of their wide variation a number of challenges still remain in terms of their identification and semantic normalisation before they can be fully exploited for research purposes. This paper presents novel techniques for identifying potential complex phenotype mentions by exploiting a hybrid model based on machine learning, rules and dictionary matching. A systematic study is made of how to combine sequence labels from these modules as well as the merits of various ontological resources. We evaluated our approach on a subset of Medline abstracts cited by the Online Mendelian Inheritance of Man database related to auto-immune diseases. Using partial matching the best micro-averaged F-score for phenotypes and five other entity classes was 79.9%. A best performance of 75.3% was achieved for phenotype candidates using all semantics resources. We observed the advantage of using SVM-based learn-to-rank for sequence label combination over maximum entropy and a priority list approach. The results indicate that the identification of simple entity types such as chemicals and genes are robustly supported by single semantic resources, whereas phenotypes require combinations. Altogether we conclude that our approach coped well with the compositional structure of phenotypes in the auto-immune domain.
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) | 2014
Nigel Collier; Mai-Vu Tran; Ferdinand Paster
Current research in fully supervised biomedical named entity recognition (bioNER) is often conducted in a setting of low sample sizes. Whilst experimental results show strong performance in-domain it has been recognised that quality suffers when models are applied to heterogeneous text collections. However the causal factors have until now been uncertain. In this paper we describe a controlled experiment into near domain bias for two Medline corpora on hereditary diseases. Five strategies are employed for mitigating the impact of near domain transference including simple transference, pooling, stacking, class re-labeling and feature augmentation. We measure their effect on f-score performance against an in domain baseline. Stacking and feature augmentation mitigate f-score loss but do not necessarily result in superior performance except for selected classes. Simple pooling of data across domains failed to exploit size effects for most classes. We conclude that we can expect lower performance and higher annotation costs if we do not adequately compensate for the distributional dissimilarities of domains during learning.
knowledge and systems engineering | 2012
Mai-Vu Tran; Minh Hoang Nguyen; Sy-Quan Nguyen; Minh-Tien Nguyen; Xuan-Hieu Phan
Event Extraction is a complex and interesting topic in Information Extraction that includes event extraction methods from free text or web data. The result of event extraction systems can be used in several fields such as risk analysis systems, online monitoring systems or decide support tools. In this paper, we introduce a method that combines lexico -- semantic and machine learning to extract event from Vietnamese news. Furthermore, we concentrate to describe event online monitoring system named VnLoc based on the method that was proposed above to extract event in Vietnamese language. Besides, in experiment phase, we have evaluated this method based on precision, recall and F1 measure. At this time of experiment, we on investigated on three types of event: FIRE, CRIME and TRANSPORT ACCIDENT.
international conference on asian language processing | 2011
Hoang-Quynh Le; Mai-Vu Tran; Nhat-Nam Bui; Nguyen-Cuong Phan; Quang-Thuy Ha
Personal names are among one of the most frequently searched items in web search engines and a person entity is always associated with numerous properties. In this paper, we propose an integrated model to recognize person entity and extract relevant values of a pre-defined set of properties related to this person simultaneously for Vietnamese. We also design a rich feature set by using various kind of knowledge resources and a apply famous machine learning method CRFs to improve the results. The obtained results show that our method is suitable for Vietnamese with the average result is 84 % of precision, 82.56% of recall and 83.39 % of F-measure. Moreover, performance time is pretty good, and the results also show the effectiveness of our feature set.
asia-pacific services computing conference | 2011
Huyen-Trang Pham; Tien-Thanh Vu; Mai-Vu Tran; Quang-Thuy Ha
Feature-based opinion mining is an interesting opinion mining issue. For this problem, feature words/phrases are discovered at sentence level. However, customers usually use different words/phrases referring to the same feature in reviews. To produce a meaningful summary, synonym feature words/phrases in domain, need to be grouped under the same feature. This paper proposes a solution for grouping synonym features in Vietnamese customer reviews based on semi-supervised SVM-kNN classification and HAC clustering. Experimental results on reviews in mobile phone domain demonstrate that the proposed method is promising for the task. The Purity, Accuracy measures are 0.68 and 0.65 respectively.
international conference on asian language processing | 2010
Mai-Vu Tran; Xuan-Tu Tran; Huy-Long Uong
To take advantage of the Internet - vast but complicated information resources, Recommendation systems help users find out information they need by providing them personalized suggestions. This research area is receiving more and more attention from researchers and used in some famous websites like EBay, Amazon, etc. In this paper, we proposed a Recommendation System for Vietnamese electronic newspaper which uses content-based filtering techniques associating with the attention of users shown in user’s profile. These users’ attentions are determined by inferring a set of common Hidden Topics from the documents which users preferred. Experimental results showed that approach is feasible with positive results and its capabilities for reality development.
international conference on asian language processing | 2011
Duc-Trong Le; Mai-Vu Tran; Tri-Thanh Nguyen; Quang-Thuy Ha
Co-reference resolution task still poses many challenges due to the complexity of the Vietnamese language, and the lack of standard Vietnamese linguistic resources. Based on the mention-pair model of Rahman and Ng. (2009) and the characteristics of Vietnamese, this paper proposes a model using support vector machines (SVM) to solve the co-reference in Vietnamese documents. The corpus used in experiments to evaluate the proposed model was constructed from 200 articles in cultural and social categories from vnexpress.net newspaper website. The results of the initial experiments of the proposed model achieved 76.51% accuracy in comparison with that of the baseline model of 73.79% with similar features.
ICCSAMA | 2015
Ngoc Trinh Vu; Van-Hien Tran; Thi-Huyen-Trang Doan; Hoang-Quynh Le; Mai-Vu Tran
Building a labeled corpus which contains sufficient data and good coverage along with solving the problems of cost, effort and time is a popular research topic in natural language processing. The problem of constructing automatic or semi-automatic training data has become a matter of the research community. For this reason, we consider the problem of building a corpus in phenotype entity recognition problem, class-specific feature detectors from unlabeled data based on over 10260 unique terms (more than 15000 synonyms) describing human phenotypic features in the Human Phenotype Ontology (HPO) and about 9000 unique terms (about 24000 synonyms) of mouse abnormal phenotype descriptions in the Mammalian Phenotype Ontology. This corpus evaluated on three corpora: Khordad corpus, Phenominer 2012 and Phenominer 2013 corpora with Maximum Entropy and Beam Search method. The performance is good for three corpora, with F-scores of 31.71% and 35.77% for Phenominer 2012 corpus and Phenominer 2013 corpus; 78.36% for Khordad corpus.
international conference on computational linguistics | 2012
Nigel Collier; Mai-Vu Tran; Hoang-Quynh Le; Anika Oellrich; Ai Kawazoe; Martin Hall-May; Dietrich Rebholz-Schuhmann
pacific asia conference on language information and computation | 2012
Mai-Vu Tran; Duc-Trong Le; Xuan-Tu Tran; Tien-Tung Nguyen