Is this you? Create Your Porfile

Kuiyu Chang

Nanyang Technological University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kuiyu Chang is active.

Explore More

Publication

Featured researches published by Kuiyu Chang.

international acm sigir conference on research and development in information retrieval | 2007

Analyzing feature trajectories for event detection

Qi He; Kuiyu Chang; Ee-Peng Lim

We consider the problem of analyzing word trajectories in both time and frequency domains, with the specific goal of identifying important and less-reported, periodic and aperiodic words. A set of words with identical trends can be grouped together to reconstruct an event in a completely un-supervised manner. The document frequency of each word across time is treated like a time series, where each element is the document frequency - inverse document frequency (DFIDF) score at one time point. In this paper, we 1) first applied spectral analysis to categorize features for different event characteristics: important and less-reported, periodic and aperiodic; 2) modeled aperiodic features with Gaussian density and periodic features with Gaussian mixture densities, and subsequently detected each features burst by the truncated Gaussian approach; 3) proposed an unsupervised greedy event detection algorithm to detect both aperiodic and periodic events. All of the above methods can be applied to time series data in general. We extensively evaluated our methods on the 1-year Reuters News Corpus [3] and showed that they were able to uncover meaningful aperiodic and periodic events.

siam international conference on data mining | 2007

Bursty feature representation for clustering text streams

Qi He; Kuiyu Chang; Ee-Peng Lim; Jun Zhang

Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We therefore introduce a new temporal representation for text streams based on bursty features. Our bursty text representation differs significantly from traditional schemes in that it 1) dynamically represents documents over time, 2) amplifies a feature in proportional to its burstiness at any point in time, and 3) is topic independent. Our bursty text representation model was evaluated against a classical bagof-words text representation on the task of clustering TDT3 topical text streams. It was shown to consistently yield more cohesive clusters in terms of cluster purity and cluster/class entropies. This new temporal bursty text representation can be extended to most text mining tasks involving a temporal dimension, such as modeling of online blog pages.

IEEE Transactions on Knowledge and Data Engineering | 2014

Identifying Features in Opinion Mining via Intrinsic and Extrinsic Domain Relevance

Zhen Hai; Kuiyu Chang; Jung-jae Kim; Christopher C. Yang

The vast majority of existing approaches to opinion feature extraction rely on mining patterns only from a single review corpus, ignoring the nontrivial disparities in word distributional characteristics of opinion features across different corpora. In this paper, we propose a novel method to identify opinion features from online reviews by exploiting the difference in opinion feature statistics across two corpora, one domain-specific corpus (i.e., the given review corpus) and one domain-independent corpus (i.e., the contrasting corpus). We capture this disparity via a measure called domain relevance (DR), which characterizes the relevance of a term to a text collection. We first extract a list of candidate opinion features from the domain review corpus by defining a set of syntactic dependence rules. For each extracted candidate feature, we then estimate its intrinsic-domain relevance (IDR) and extrinsic-domain relevance (EDR) scores on the domain-dependent and domain-independent corpora, respectively. Candidate features that are less generic (EDR score less than a threshold) and more domain-specific (IDR score greater than another threshold) are then confirmed as opinion features. We call this interval thresholding approach the intrinsic and extrinsic domain relevance (IEDR) criterion. Experimental results on two real-world review domains show the proposed IEDR approach to outperform several other well-established methods in identifying opinion features.

international conference on computational linguistics | 2011

Implicit feature identification via co-occurrence association rule mining

Zhen Hai; Kuiyu Chang; Jung-jae Kim

In sentiment analysis, identifying features associated with an opinion can help produce a finer-grained understanding of online reviews. The vast majority of existing approaches focus on explicit feature identification, few attempts have been made to identify implicit features in reviews. In this paper, we propose a novel two-phase co-occurrence association rule mining approach to identifying implicit features. Specifically, in the first phase of rule generation, for each opinion word occurring in an explicit sentence in the corpus, we mine a significant set of association rules of the form [opinion-word, explicit-feature] from a co-occurrence matrix. In the second phase of rule application, we first cluster the rule consequents (explicit features) to generate more robust rules for each opinion word mentioned above. Given a new opinion word with no explicit feature, we then search a matched list of robust rules, among which the rule having the feature cluster with the highest frequency weight is fired, and accordingly, we assign the representative word of the cluster as the final identified implicit feature. Experimental results show considerable improvements of our approach over other related methods including baseline dictionary lookups, statistical semantic association models, and bi-bipartite reinforcement clustering.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2010

Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models

Qi He; Kuiyu Chang; Ee-Peng Lim; Arindam Banerjee

Topic detection (TD) is a fundamental research issue in the Topic Detection and Tracking (TDT) community with practical implications; TD helps analysts to separate the wheat from the chaff among the thousands of incoming news streams. In this paper, we propose a simple and effective topic detection model called the temporal Discriminative Probabilistic Model (DPM), which is shown to be theoretically equivalent to the classic vector space model with feature selection and temporally discriminative weights. We compare DPM to its various probabilistic cousins, ranging from mixture models like von-Mises Fisher (vMF) to mixed membership models like Latent Dirichlet Allocation (LDA). Benchmark results on the TDT3 data set show that sophisticated models, such as vMF and LDA, do not necessarily lead to better results; in the case of LDA, notably worst performance was obtained under variational inference, which is likely due to the significantly large number of LDA model parameters involved for document-level topic detection. On the contrary, using a relatively simple time-aware probabilistic model such as DPM suffices for both offline and online topic detection tasks, making DPM a theoretically elegant and effective model for practical topic detection.

international conference on data mining | 2007

Using Burstiness to Improve Clustering of Topics in News Streams

Qi He; Kuiyu Chang; Ee-Peng Lim

Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.

conference on information and knowledge management | 2013

Uncovering collusive spammers in Chinese review websites

Chang Xu; Jie Zhang; Kuiyu Chang; Chong Long

As the rapid development of Chinas e-commerce in recent years and the underlying evolution of adversarial spamming tactics, more sophisticated spamming activities may carry out in Chinese review websites. Empirical analysis, on recently crawled product reviews from a popular Chinese e-commerce website, reveals the failure of many state-of-the-art spam indicators on detecting collusive spammers. Two novel methods are then proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewers based on their pairwise transaction correlations. Experimental results show that both our methods promisingly outperform the indicator-only classifiers in various settings.

international conference on data mining | 2010

Micro-blogging Sentiment Detection by Collaborative Online Learning

Guangxia Li; Steven C. H. Hoi; Kuiyu Chang; Ramesh Jain

We study the online micro-blog sentiment detection problem, which aims to determine whether a micro-blog post expresses emotions. This problem is challenging because a micro-blog post is very short and individuals have distinct ways of expressing emotions. A single classification model trained on the entire corpus may fail to capture characteristics unique to each user. On the other hand, a personalized model for each user may be inaccurate due to the scarcity of training data, especially at the very beginning where users have just posted a few entries. To overcome these challenges, we propose learning a global model over all micro-bloggers, which is then leveraged to continuously refine the individual models through a collaborative online learning way. We evaluate our algorithm on a real-life micro-blog dataset collected from the popular micro-blog site – Twitter. Results show that our algorithm is effective and efficient for timely sentiment detection in real micro-blogging applications.

siam international conference on data mining | 2010

Two-view Transductive Support Vector Machines

Guangxia Li; Steven C. H. Hoi; Kuiyu Chang

Obtaining high-quality and up-to-date labeled data can be difficult in many real-world machine learning applications, especially for Internet classification tasks like review spam detection, which changes at a very brisk pace. For some problems, there may exist multiple perspectives, so called views, of each data sample. For example, in text classification, the typical view contains a large number of raw content features such as term frequency, while a second view may contain a small but highly-informative number of domain specific features. We thus propose a novel two-view transductive SVM that takes advantage of both the abundant amount of unlabeled data and their multiple representations to improve the performance of classifiers. The idea is fairly simple: train a classifier on each of the two views of both labeled and unlabeled data, and impose a global constraint that each classifier assigns the same class label to each labeled and unlabeled data. We applied our two-view transductive SVM to the WebKB course dataset, and a reallife review spam classification dataset. Experimental results show that our proposed approach performs up to 5% better than a single view learning algorithm, especially when the amount of labeled data is small. The other advantage of our two-view approach is its significantly improved stability, which is especially useful for noisy real world data.

IEEE Transactions on Knowledge and Data Engineering | 2012

Multiview Semi-Supervised Learning with Consensus

Guangxia Li; Kuiyu Chang; Steven C. H. Hoi

Obtaining high-quality and up-to-date labeled data can be difficult in many real-world machine learning applications. Semi-supervised learning aims to improve the performance of a classifier trained with limited number of labeled data by utilizing the unlabeled ones. This paper demonstrates a way to improve the transductive SVM, which is an existing semi-supervised learning algorithm, by employing a multiview learning paradigm. Multiview learning is based on the fact that for some problems, there may exist multiple perspectives, so called views, of each data sample. For example, in text classification, the typical view contains a large number of raw content features such as term frequency, while a second view may contain a small but highly informative number of domain specific features. We propose a novel two-view transductive SVM that takes advantage of both the abundant amount of unlabeled data and their multiple representations to improve classification result. The idea is straightforward: train a classifier on each of the two views of both labeled and unlabeled data, and impose a global constraint requiring each classifier to assign the same class label to each labeled and unlabeled sample. We also incorporate manifold regularization, a kind of graph-based semi-supervised learning method into our framework. The proposed two-view transductive SVM was evaluated on both synthetic and real-life data sets. Experimental results show that our algorithm performs up to 10 percent better than a single-view learning approach, especially when the amount of labeled data is small. The other advantage of our two-view semi-supervised learning approach is its significantly improved stability, which is especially useful when dealing with noisy data in real-world applications.

Explore More