Anjie Fang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anjie Fang is active.

Explore More

Publication

Featured researches published by Anjie Fang.

international acm sigir conference on research and development in information retrieval | 2016

Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data

Anjie Fang; Craig Macdonald; Iadh Ounis; Philip Habel

Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.

european conference on information retrieval | 2017

Exploring Time-Sensitive Variational Bayesian Inference LDA for Social Media Data

Anjie Fang; Craig Macdonald; Iadh Ounis; Philip Habel; Xiao Yang

There is considerable interest among both researchers and the mass public in understanding the topics of discussion on social media as they occur over time. Scholars have thoroughly analysed sampling-based topic modelling approaches for various text corpora including social media; however, another LDA topic modelling implementation—Variational Bayesian (VB)—has not been well studied, despite its known efficiency and its adaptability to the volume and dynamics of social media data. In this paper, we examine the performance of the VB-based topic modelling approach for producing coherent topics, and further, we extend the VB approach by proposing a novel time-sensitive Variational Bayesian implementation, denoted as TVB. Our newly proposed TVB approach incorporates time so as to increase the quality of the generated topics. Using a Twitter dataset covering 8 events, our empirical results show that the coherence of the topics in our TVB model is improved by the integration of time. In particular, through a user study, we find that our TVB approach generates less mixed topics than state-of-the-art topic modelling approaches. Moreover, our proposed TVB approach can more accurately estimate topical trends, making it particularly suitable to assist end-users in tracking emerging topics on social media.

european conference on information retrieval | 2016

Topics in Tweets: A User Study of Topic Coherence Metrics for Twitter Data

Anjie Fang; Craig Macdonald; Iadh Ounis; Philip Habel

Twitter offers scholars new ways to understand the dynamics of public opinion and social discussions. However, in order to understand such discussions, it is necessary to identify coherent topics that have been discussed in the tweets. To assess the coherence of topics, several automatic topic coherence metrics have been designed for classical document corpora. However, it is unclear how suitable these metrics are for topic models generated from Twitter datasets. In this paper, we use crowdsourcing to obtain pairwise user preferences of topical coherences and to determine how closely each of the metrics align with human preferences. Moreover, we propose two new automatic coherence metrics that use Twitter as a separate background dataset to measure the coherence of topics. We show that our proposed Pointwise Mutual Information-based metric provides the highest levels of agreement with human preferences of topic coherence over two Twitter datasets.

international acm sigir conference on research and development in information retrieval | 2016

Examining the Coherence of the Top Ranked Tweet Topics

Anjie Fang; Craig Macdonald; Iadh Ounis; Philip Habel

Topic modelling approaches help scholars to examine the topics discussed in a corpus. Due to the popularity of Twitter, two distinct methods have been proposed to accommodate the brevity of tweets: the tweet pooling method and Twitter LDA. Both of these methods demonstrate a higher performance in producing more interpretable topics than the standard Latent Dirichlet Allocation (LDA) when applied on tweets. However, while various metrics have been proposed to estimate the coherence of the generated topics from tweets, the coherence of the top ranked topics, those that are most likely to be examined by users, has not been investigated. In addition, the effect of the number of generated topics K on the topic coherence scores has not been studied. In this paper, we conduct large-scale experiments using three topic modelling approaches over two Twitter datasets, and apply a state-of-the-art coherence metric to study the coherence of the top ranked topics and how K affects such coherence. Inspired by ranking metrics such as precision at n, we use coherence at n to assess the coherence of a topic model. To verify our results, we conduct a pairwise user study to obtain human preferences over topics. Our findings are threefold: we find evidence that Twitter LDA outperforms both LDA and the tweet pooling method because the top ranked topics it generates have more coherence; we demonstrate that a larger number of topics (K) helps to generate topics with more coherence; and finally, we show that coherence at n is more effective when evaluating the coherence of a topic model than the average coherence score.

european conference on information retrieval | 2018

On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification

Ting Su; Anjie Fang; Richard McCreadie; Craig Macdonald; Iadh Ounis

To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance.

european conference on information retrieval | 2018

On the Reproducibility and Generalisation of the Linear Transformation of Word Embeddings

Xiao Yang; Iadh Ounis; Richard McCreadie; Craig Macdonald; Anjie Fang

Linear transformation is a way to learn a linear relationship between two word embeddings, such that words in the two different embedding spaces can be semantically related. In this paper, we examine the reproducibility and generalisation of the linear transformation of word embeddings. Linear transformation is particularly useful when translating word embedding models in different languages, since it can capture the semantic relationships between two models. We first reproduce two linear transformation approaches, a recent one using orthogonal transformation and the original one using simple matrix transformation. Previous findings on a machine translation task are re-examined, validating that linear transformation is indeed an effective way to transform word embedding models in different languages. In particular, we show that the orthogonal transformation can better relate the different embedding models. Following the verification of previous findings, we then study the generalisation of linear transformation in a multi-language Twitter election classification task. We observe that the orthogonal transformation outperforms the matrix transformation. In particular, it significantly outperforms the random classifier by at least 10% under the F1 metric across English and Spanish datasets. In addition, we also provide best practices when using linear transformation for multi-language Twitter election classification.

conference on information and knowledge management | 2018

An Effective Approach for Modelling Time Features for Classifying Bursty Topics on Twitter

Anjie Fang; Iadh Ounis; Craig Macdonald; Philip Habel; Xiaoyu Xiong; Haitao Yu

Several previous approaches attempted to predict bursty topics on Twitter. Such approaches have usually reported that the time information (e.g. the topic popularity over time) of hashtag topics contribute the most to the prediction of bursty topics. In this paper, we propose a novel approach to use time features to predict bursty topics on Twitter. We model the popularity of topics as density curves described by the density function of a beta distribution with different parameters. We then propose various approaches to predict/classify the bursty topics by estimating the parameters of topics, using estimators such as Gradient Decent or Likelihood Maximization. In our experiments, we show that the estimated parameters of topics have a positive effect on classifying bursty topics. In particular, our estimators when combined together improve the bursty topic classification by 6.9 in terms of micro F1 compared to a baseline classifier using hashtag content features.

international acm sigir conference on research and development in information retrieval | 2015