Aijun An | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aijun An is active.

Explore More

Publication

Featured researches published by Aijun An.

international acm sigir conference on research and development in information retrieval | 2007

ARSA: a sentiment-aware model for predicting sales performance using blogs

Yang Liu; Xiangji Huang; Aijun An; Xiaohui Yu

Due to its high popularity, Weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general publics sentiments and opinions. In this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. Based on an analysis of the complex nature of sentiments, we propose Sentiment PLSA (S-PLSA), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. Training an S-PLSA model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. We then present ARSA, an autoregressive sentiment-aware model, to utilize the sentiment information captured by S-PLSA for predicting product sales performance. Extensive experiments were conducted on a movie data set. We compare ARSA with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. Experiments confirm the effectiveness and superiority of the proposed approach.

international conference on data mining | 2008

Modeling and Predicting the Helpfulness of Online Reviews

Yang Liu; Xiangji Huang; Aijun An; Xiaohui Yu

Online reviews provide a valuable resource for potential customers to make purchase decisions. However, the sheer volume of available reviews as well as the large variations in the review quality present a big impediment to the effective use of the reviews, as the most helpful reviews may be buried in the large amount of low quality reviews. The goal of this paper is to develop models and algorithms for predicting the helpfulness of reviews, which provides the basis for discovering the most helpful reviews for given products. We first show that the helpfulness of a review depends on three important factors: the reviewerpsilas expertise, the writing style of the review, and the timeliness of the review. Based on the analysis of those factors, we present a nonlinear regression model for helpfulness prediction. Our empirical study on the IMDB movie reviews dataset demonstrates that the proposed approach is highly effective.

Engineering Applications of Artificial Intelligence | 1996

Discovering rules for water demand prediction: An enhanced rough-set approach☆

Aijun An; Ning Shan; Christine W. Chan; Nick Cercone; Wojciech Ziarko

Abstract Prediction of consumer demands is a pre-requisite for optimal control of water distribution systems because minimum-cost pumping schedules can be computed if water demands are accurately estimated. This paper presents an enhanced rough-sets method for generating prediction rules from a set of observed data. The proposed method extends upon the standard rough set model by making use of the statistical information inherent in the data to handle incomplete and ambiguous training samples. It also discusses some experimental results from using this method for discovering knowledge on water demand prediction.

IEEE Transactions on Knowledge and Data Engineering | 2012

Mining Online Reviews for Predicting Sales Performance: A Case Study in the Movie Domain

Xiaohui Yu; Yang Liu; Xiangji Huang; Aijun An

Posting reviews online has become an increasingly popular way for people to express opinions and sentiments toward the products bought or services received. Analyzing the large volume of online reviews available would produce useful actionable knowledge that could be of economic values to vendors and other interested parties. In this paper, we conduct a case study in the movie domain, and tackle the problem of mining reviews for predicting product sales performance. Our analysis shows that both the sentiments expressed in the reviews and the quality of the reviews have a significant impact on the future sales performance of products in question. For the sentiment factor, we propose Sentiment PLSA (S-PLSA), in which a review is considered as a document generated by a number of hidden sentiment factors, in order to capture the complex nature of sentiments. Training an S-PLSA model enables us to obtain a succinct summary of the sentiment information embedded in the reviews. Based on S-PLSFA, we propose ARSA, an Autoregressive Sentiment-Aware model for sales prediction. We then seek to further improve the accuracy of prediction by considering the quality factor, with a focus on predicting the quality of a review in the absence of user-supplied indicators, and present ARSQA, an Autoregressive Sentiment and Quality Aware model, to utilize sentiments and quality for predicting product sales performance. Extensive experiments conducted on a large movie data set confirm the effectiveness of the proposed approach.

conference on information and knowledge management | 2011

Discovering top-k teams of experts with/without a leader in social networks

Mehdi Kargar; Aijun An

We study the problem of discovering a team of experts from a social network. Given a project whose completion requires a set of skills, our goal is to find a set of experts that together have all of the required skills and also have the minimal communication cost among them. We propose two communication cost functions designed for two types of communication structures. We show that the problem of finding the team of experts that minimizes one of the proposed cost functions is NP-hard. Thus, an approximation algorithm with an approximation ratio of two is designed. We introduce the problem of finding a team of experts with a leader. The leader is responsible for monitoring and coordinating the project, and thus a different communication cost function is used in this problem. To solve this problem, an exact polynomial algorithm is proposed. We show that the total number of teams may be exponential with respect to the number of required skills. Thus, two procedures that produce top-k teams of experts with or without a leader in polynomial delay are proposed. Extensive experiments on real datasets demonstrate the effectiveness and scalability of the proposed methods.

Information Processing and Management | 2011

Combining integrated sampling with SVM ensembles for learning from imbalanced datasets

Yang Liu; Xiaohui Yu; Jimmy Xiangji Huang; Aijun An

Learning from imbalanced datasets is difficult. The insufficient information that is associated with the minority class impedes making a clear understanding of the inherent structure of the dataset. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced, because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs may suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique, which incorporates both over-sampling and under-sampling, with an ensemble of SVMs to improve the prediction performance. Extensive experiments show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.

very large data bases | 2011

Keyword search in graphs: finding r-cliques

Mehdi Kargar; Aijun An

Keyword search over a graph finds a substructure of the graph containing all or some of the input keywords. Most of previous methods in this area find connected minimal trees that cover all the query keywords. Recently, it has been shown that finding subgraphs rather than trees can be more useful and informative for the users. However, the current tree or graph based methods may produce answers in which some content nodes (i.e., nodes that contain input keywords) are not very close to each other. In addition, when searching for answers, these methods may explore the whole graph rather than only the content nodes. This may lead to poor performance in execution time. To address the above problems, we propose the problem of finding r-cliques in graphs. An r-clique is a group of content nodes that cover all the input keywords and the distance between each two nodes is less than or equal to r. An exact algorithm is proposed that finds all r-cliques in the input graph. In addition, an approximation algorithm that produces r-cliques with 2-approximation in polynomial delay is proposed. Extensive performance studies using two large real data sets confirm the efficiency and accuracy of finding r-cliques in graphs.

computational intelligence | 2001

RULE QUALITY MEASURES FOR RULE INDUCTION SYSTEMS: DESCRIPTION AND EVALUATION

Aijun An; Nick Cercone

A rule quality measure is important to a rule induction system for determining when to stop generalization or specialization. Such measures are also important to a rule‐based classification procedure for resolving conflicts among rules. We describe a number of statistical and empirical rule quality formulas and present an experimental comparison of these formulas on a number of standard machine learning datasets. We also present a meta‐learning method for generating a set of formula‐behavior rules from the experimental results which show the relationships between a formulas performance and the characteristics of a dataset. These formula‐behavior rules are combined into formula‐selection rules that can be used in a rule induction system to select a rule quality formula before rule induction. We will report the experimental results showing the effects of formula‐selection on the predictive performance of a rule induction system.

Journal of the Association for Information Science and Technology | 2004

Dynamic web log session identification with statistical language models

Xiangji Huang; Fuchun Peng; Aijun An; Dale Schuurmans

We present a novel session identification method based on statistical language modeling. Unlike standard time-out methods, which use fixed time thresholds for session identification, we use an information theoretic approach that yields more robust results for identifying session boundaries. We evaluate our new approach by learning interesting association rules from the segmented session files. We then compare the performance of our approach to three standard session identification method--the standard timeout method, the reference length method, and the maximal forward reference method--and find that our statistical language modeling approach generally yields superior results. However, as with every method, the performance of our technique varies with changing parameter settings. Therefore, we also analyze the influence of the two key factors in our language-modeling-based approach: the choice of smoothing technique and the language model order. We find that all standard smoothing techniques, save one, perform well, and that performance is robust to language model order.

IEEE Transactions on Knowledge and Data Engineering | 1999

Rule-induction and case-based reasoning: hybrid architectures appear advantageous

Nick Cercone; Aijun An; Christine W. Chan

Researchers have embraced a variety of machine learning (ML) techniques in their efforts to improve the quality of learning programs. The recent evolution of hybrid architectures for machine learning systems has resulted in several approaches that combine rule induction methods with case-based reasoning techniques to engender performance improvements over more traditional single-representation architectures. We briefly survey several major rule-induction and case-based reasoning ML systems. We then examine some interesting hybrid combinations of these systems and explain their strengths and weaknesses as learning systems. We present a balanced approach to constructing a hybrid architecture, along with arguments in favor of this balance and mechanisms for achieving a proper balance. Finally, we present some initial empirical results from testing our ideas and draw some conclusions based on those results.

Explore More