Viet Ha-Thuc | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Viet Ha-Thuc is active.

Explore More

Publication

Featured researches published by Viet Ha-Thuc.

conference on information and knowledge management | 2008

Topic models and a revisit of text-related applications

Viet Ha-Thuc; Padmini Srinivasan

Topic models such as aspect model or LDA have been shown as a promising approach for text modeling. Unlike many previous models that restrict each document to a single topic, topic models support the important idea that each document could be relevant to multiple topics. This makes topic models significantly more expressive in modeling text documents. However, we observe two limitations in topic models. One is that of scalability as it is extremely expensive to run the models on large corpora. The other limitation is the inability to model the key concept of relevance. This prevents the models from being directly applied to goals such as text classification and relevance feedback for query modification; in these goals, items relevant to topics (classes and queries) are provided upfront. The first aim of this paper is to sketch solutions for these limitations. To alleviate the scalability problem, we introduce a one-scan topic model requiring only a single pass over a corpus for inference. To overcome the latter, we propose relevance-based topic models that have the advantages of previous models while taking the concept of relevance into account. The second aim, based on the proposed models, is to revisit a wide range of well-known but still open text-related tasks, and outline our vision on how the approaches for the tasks could be improved by topic models.

web search and data mining | 2011

Large-scale hierarchical text classification without labelled data

Viet Ha-Thuc; Jean-Michel Renders

The traditional machine learning approaches for text classification often require labelled data for learning classifiers. However, when applied to large-scale classification involving thousands of categories, creating such labelled data is extremely expensive since typically the data is manually labelled by humans. Motivated by this, we propose a novel approach for large-scale hierarchical text classification which does not require any labelled data. We explore a perspective where the meaning of a category is not defined by human-labelled documents, but by its description and more importantly its relationships with other categories (e.g. its ascendants and descendants). Specifically, we take advantage of the ontological knowledge in all phases of the whole process, namely when retrieving pseudo-labelled documents, when iteratively training the category models and when categorizing test documents. Our experiments based on a taxonomy containing 1131 categories and widely adopted in the news industry as a standard for the NewsML framework demonstrate the effectiveness of our approach in these phases both qualitatively and quantitatively. In particular, we emphasize that just by taking the simple ontological knowledge defined in the category hierarchy, we could automatically build a large-scale hierarchical classifier with reasonable performance of 67% in terms of the hierarchy-based F-1 measure.

international acm sigir conference on research and development in information retrieval | 2009

A relevance-based topic model for news event tracking

Viet Ha-Thuc; Yelena Mejova; Christopher G. Harris; Padmini Srinivasan

Event tracking is the task of discovering temporal patterns of popular events from text streams. Existing approaches for event tracking have two limitations: scalability and inability to rule out non-relevant portions in text streams. In this study, we propose a novel approach to tackle these limitations. To demonstrate the approach, we track news events across a collection of weblogs spanning a two-month time period.

2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies | 2008

A quality-threshold data summarization algorithm

Viet Ha-Thuc; Duc-Cuong Nguyen; Padmini Srinivasan

As database sizes increase, semantic data summarization techniques have been developed, so that data mining algorithms can be run on the summarized set for the sake of efficiency. Clustering algorithms such as K-Means have popularly been used as semantic summarization methods where cluster centers become the summarized set. The goal of semantic summarization is to provide a summarized view of the original dataset such that the summarization ratio is maximized while the error (i.e., information loss) is minimized. This paper presents a new clustering-based data summarization algorithm, in which the quality of the summarized set can be controlled. The algorithm partitions a dataset into a number of clusters until the distortion of each cluster is less than a given threshold, thus guaranteeing the summarized set has less than a fixed amount of information loss. Based on the threshold, the number of clusters is automatically determined. The proposed algorithm, unlike traditional K-Means, adjusts initial centers based on the information about the data space discovered so far, thus significantly alleviating the local optimum effect. Our experiments show that our algorithm generates higher quality clusters than K-Means does and it also guarantees an error bound, an essential criterion for data summarization.

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality | 2011

Spam detection in online classified advertisements

Hung Tran; Thomas Hornbeck; Viet Ha-Thuc; James F. Cremer; Padmini Srinivasan

Online classified advertisements have become an essential part of the advertisement market. Popular online classified advertisement sites such as Craigslist, Ebay Classifieds, and Oodle have attracted a huge number of posts and visits. Due to its high commercial potential, the online classified advertisement domain is a target for spammers, and this has become one of the biggest issues hindering further development of online advertisement. Therefore, spam detection in online advertisement is a crucial problem. However, previous approaches for Web spam detection in other domains do not work well in the advertisement domain. We propose a novel spam detection approach that takes into account the particular characteristics of this domain. Specifically, we propose a novel set of features that could strongly discriminate between spam and legitimate advertisement posts. Our experiments on a dataset derived from Craigslist advertisements demonstrate the effectiveness of our approach. In particular, the approach provides improvements of 55% in terms of F-1 score over a baseline that uses traditional features alone.

asia information retrieval symposium | 2009

A Latent Dirichlet Framework for Relevance Modeling

Viet Ha-Thuc; Padmini Srinivasan

Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. This could limit model robustness and effectiveness. In this study, we propose a Latent Dirichlet relevance model, which relaxes this assumption. Our approach derives from current research on Latent Dirichlet Allocation (LDA) topic models. LDA has been extensively explored, especially for discovering a set of topics from a corpus. LDA itself, however, has a limitation that is also addressed in our work. Topics generated by LDA from a corpus are synthetic, i.e., they do not necessarily correspond to topics identified by humans for the same corpus. In contrast, our model explicitly considers the relevance relationships between documents and given topics (queries). Thus unlike standard LDA, our model is directly applicable to goals such as relevance feedback for query modification and text classification, where topics (classes and queries) are provided upfront. Thus although the focus of our paper is on improving relevance-based language models, in effect our approach bridges relevance-based language models and LDA addressing limitations of both.

ieee international conference semantic computing | 2010

News Event Modeling and Tracking in the Social Web with Ontological Guidance

Viet Ha-Thuc; Yelena Mejova; Christopher G. Harris; Padmini Srinivasan

News event modeling and tracking in the social web is the task of discovering which news events individuals in social communities are most interested in, how much discussion these events generate and tracking these discussions over time. The task could provide informative summaries on what has happened in the real world, yield important knowledge on what are the most important events from the crowds perspective and reveal their temporal evolutionary trends. Latent Dirichlet Allocation (LDA) has been used intensively for modeling and tracking events (or topics) in text streams. However, the event models discovered by this bottom-up approach have limitations such as a lack of semantic correspondence to real world events. Besides, they do not scale well to large datasets. This paper proposes a novel latent Dirichlet framework for event modeling and tracking. Our approach takes into account ontological knowledge on events that exist in the real world to guide the modeling and tracking processes. Therefore, event models extracted from the social web by our approach are always meaningful and semantically match with real world events. Practically, our approach requires only a single scan over the dataset to model and track events and hence scales well with dataset size.

2009 ICWSM Workshop | 2009