Xuan Hieu Phan
Japan Advanced Institute of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xuan Hieu Phan.
IEEE Transactions on Knowledge and Data Engineering | 2011
Xuan Hieu Phan; Cam-Tu Nguyen; Dieu-Thu Le; Le-Minh Nguyen; Susumu Horiguchi; Quang-Thuy Ha
This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.
ACM Transactions on Asian Language Information Processing | 2009
Cam-Tu Nguyen; Xuan Hieu Phan; Susumu Horiguchi; Thu-Trang Nguyen; Quang-Thuy Ha
Web search clustering is a solution to reorganize search results (also called “snippets”) in a more convenient way for browsing. There are three key requirements for such post-retrieval clustering systems: (1) the clustering algorithm should group similar documents together; (2) clusters should be labeled with descriptive phrases; and (3) the clustering system should provide high-quality clustering without downloading the whole Web page. This article introduces a novel framework for clustering Web search results in Vietnamese which targets the three above issues. The main motivation is that by enriching short snippets with hidden topics from huge resources of documents on the Internet, it is able to cluster and label such snippets effectively in a topic-oriented manner without concerning whole Web pages. Our approach is based on recent successful topic analysis models, such as Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation. The underlying idea of the framework is that we collect a very large external data collection called “universal dataset,” and then build a clustering system on both the original snippets and a rich set of hidden topics discovered from the universal data collection. This can be seen as a richer representation of snippets to be clustered. We carry out careful evaluation of our method and show that our method can yield impressive clustering quality.
IEICE Transactions on Information and Systems | 2006
Xuan Hieu Phan; Le-Minh Nguyen; Susumu Horiguchi
Cross-document personal name resolution is the process of identifying whether or not a common personal name mentioned in different documents refers to the same individual. Most previous approaches usually rely on lexical matching such as the occurrence of common words surrounding the entity name to measure the similarity between documents, and then clusters the documents according to their referents. In spite of certain successes, measuring similarity based on lexical comparison sometimes ignores important linguistic phenomena at the semantic level such as synonym or paraphrase. This paper presents a semantics-based approach to the resolution of personal name crossover documents that can make the most of both lexical evidences and semantic clues. In our method, the similarity values between documents are determined by estimating the semantic relatedness between words. Further, the semantic labels attached to sentences allow us to highlight the common personal facts that are potentially available among documents. An evaluation on three web datasets demonstrates that our method achieves the better performance than the previous work.
meeting of the association for computational linguistics | 2006
Le-Minh Nguyen; Akira Shimazu; Xuan Hieu Phan
We present a learning framework for structured support vector models in which boosting and bagging methods are used to construct ensemble models. We also propose a selection method which is based on a switching model among a set of outputs of individual classifiers when dealing with natural language parsing problems. The switching model uses subtrees mined from the corpus and a boosting-based algorithm to select the most appropriate output. The application of the proposed framework on the domain of semantic parsing shows advantages in comparison with the original large margin methods.
advanced information networking and applications | 2008
Zhiwei Zhang; Xuan Hieu Phan; Susumu Horiguchi
Text categorization is an important research area in information retrieval. In order to save the storage space and get better accuracy, efficient and effective feature selection methods for reducing the data before analysis are highly desired. Usually, researches on feature selection use only a proper measurement such as information gain. In this paper, we propose a new feature selection method by adopting an attractive hidden topic analysis and entropy-based feature ranking. Experiments dealing with the well-known Reuters-21578 and Ohsumed datasets show that our method can achieve a better classification accuracy while reducing the feature dimension dramatically.
IEICE Transactions on Information and Systems | 2007
Xuan Hieu Phan; Le-Minh Nguyen; Yasushi Inoguchi; Susumu Horiguchi
Conditional random fields (CRFs) have been successfully applied to various applications of predicting and labeling structured data, such as natural language tagging & parsing, image segmentation & object recognition, and protein secondary structure prediction. The key advantages of CRFs are the ability to encode a variety of overlapping, non-independent features from empirical data as well as the capability of reaching the global normalization and optimization. However, estimating parameters for CRFs is very time-consuming due to an intensive forward-backward computation needed to estimate the likelihood function and its gradient during training. This paper presents a high-performance training of CRFs on massively parallel processing systems that allows us to handle huge datasets with hundreds of thousand data sequences and millions of features. We performed the experiments on an important natural language processing task (text chunking) on large-scale corpora and achieved significant results in terms of both the reduction of computational time and the improvement of prediction accuracy.
International Journal of Business Intelligence and Data Mining | 2005
Xuan Hieu Phan; Susumu Horiguchi; Tu Bao Ho
Extracting data on the Web is an important information extraction task. Most existing approaches rely on wrappers which require human knowledge and user interaction during extraction. This paper proposes the use of conditional models as an alternative solution to this task. Deriving the strength of conditional models like maximum entropy and maximum entropy Markov models, our method offers three major advantages: the full automation, the ability to incorporate various non-independent, overlapping features of different hypertext representations, and the ability to deal with missing and disordered data fields. The experimental results on a wide range of e-commercial websites with different layouts show that our method can achieve a satisfactory trade-off between automation and accuracy, and also provide a practical application of automated data extraction from the Web.
PLOS ONE | 2010
Vitali Sintchenko; Stephen Anthony; Xuan Hieu Phan; Frank Lin; Enrico Coiera
Background Computational discovery is playing an ever-greater role in supporting the processes of knowledge synthesis. A significant proportion of the more than 18 million manuscripts indexed in the PubMed database describe infectious disease syndromes and various infectious agents. This study is the first attempt to integrate online repositories of text-based publications and microbial genome databases in order to explore the dynamics of relationships between pathogens and infectious diseases. Methodology/Principal Findings Herein we demonstrate how the knowledge space of infectious diseases can be computationally represented and quantified, and tracked over time. The knowledge space is explored by mapping of the infectious disease literature, looking at dynamics of literature deposition, zooming in from pathogen to genome level and searching for new associations. Syndromic signatures for different pathogens can be created to enable a new and clinically focussed reclassification of the microbial world. Examples of syndrome and pathogen networks illustrate how multilevel network representations of the relationships between infectious syndromes, pathogens and pathogen genomes can illuminate unexpected biological similarities in disease pathogenesis and epidemiology. Conclusions/Significance This new approach based on text and data mining can support the discovery of previously hidden associations between diseases and microbial pathogens, clinically relevant reclassification of pathogenic microorganisms and accelerate the translational research enterprise.
conference on information and knowledge management | 2010
Cam-Tu Nguyen; Natsuda Kaothanthong; Xuan Hieu Phan; Takeshi Tokuyama
Image annotation is to automatically associate semantic labels with images in order to obtain a more convenient way for indexing and searching images on the Web. This paper proposes a novel method for image annotation based on feature-word and word-topic distributions. The introduction of topics enables us to efficiently take word associations, such as {ocean, fish, coral}, into image annotation. Feature-word distributions are utilized to define weights in computation of topic distributions for annotation. By doing so, topic models in text mining can be applied directly in our method. Experiments show that our method is able to obtain promising improvements over the state-of-the-art method - Supervised Multiclass Labeling (SML)
web intelligence | 2008
Dieu-Thu Le; Cam-Tu Nguyen; Quang-Thuy Ha; Xuan Hieu Phan; Susumu Horiguchi
In online contextual advertising, ad messages are displayed related to the content of the target Web page. It leads to the problem in information retrieval community: how to select the most relevant ad messages given the content of a page. To deal with this problem, we propose a framework that takes advantage of large scale external datasets. This framework provides a mechanism to discover the semantic relations between Web pages and ad messages by analyzing topics for them. This helps overcome the problem of mismatch due to unimportant words and the difference in vocabularies between Web pages and ad messages. The framework has been evaluated through a number of experiments. It shows a significant improvement in accuracy over word/lexicon-based matching and ranking methods.