Xiaomo Liu
Thomson Reuters
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiaomo Liu.
conference on information and knowledge management | 2016
Xiaomo Liu; Quanzhi Li; Armineh Nourbakhsh; Rui Fang; Merine Thomas; Kajsa Anderson; Russ Kociuba; Mark Vedder; Steven Pomerville; Ramdev Wudali; Robert Martin; John Duprey; Arun Vachher; William M. Keenan; Sameena Shah
News professionals are facing the challenge of discovering news from more diverse and unreliable information in the age of social media. More and more news events break on social media first and are picked up by news media subsequently. The recent Brussels attack is such an example. At Reuters, a global news agency, we have observed the necessity of providing a more effective tool that can help our journalists to quickly discover news on social media, verify them and then inform the public. In this paper, we describe Reuters Tracer, a system for sifting through all noise to detect news events on Twitter and assessing their veracity. We disclose the architecture of our system and discuss the various design strategies that facilitate the implementation of machine learning models for noise filtering and event detection. These techniques have been implemented at large scale and successfully discovered breaking news faster than traditional journalism
conference on information and knowledge management | 2016
Quanzhi Li; Sameena Shah; Xiaomo Liu; Armineh Nourbakhsh; Rui Fang
Classifying tweets into topic categories is necessary and important for many applications, since tweets are about a variety of topics and users are only interested in certain topical areas. Many tweet classification approaches fail to achieve high accuracy due to data sparseness issue. Tweet, as a special type of short text, in additional to its text, also has other metadata that can be used to enrich its context, such as user name, mention, hashtag and embedded link. In this demonstration, we present TweetSift, an efficient and effective real time tweet topic classifier. TweetSift exploits external tweet-specific entity knowledge to provide more topical context for a tweet, and integrates them with topic enhanced word embeddings for topic classification. The demonstration will show how TweetSift works and how it is incorporated with our social media event detection system.
web intelligence | 2016
Quanzhi Li; Sameena Shah; Xiaomo Liu; Armineh Nourbakhsh; Rui Fang
Many classification tasks on short text, such as tweet, fail to achieve high accuracy due to data sparseness. One approach to solving this problem is to enrich the context of data by using external data sources, or distributed language representations trained on huge amount of data. In this paper, we present several tweet topic classification methods by exploiting different types of data: tweet text, tweet text plus entity knowledge base, word embeddings derived from tweet text, distributed representations of tweets, and topical word embeddings. The word embedding, topical word embedding and sentence representation models are generated from billions of words from tweets without supervision. To the best of our knowledge, this is the first study of applying distributed language representations to tweet topic classification task.
international conference on data engineering | 2017
Quanzhi Li; Armineh Nourbakhsh; Sameena Shah; Xiaomo Liu
In this paper, we present a new approach for detecting novel events from social media, specially Twitter, at real-time. An event is usually defined by who, what, where and when, and an event tweet usually contains terms corresponding to these aspects. To exploit this information, we propose a method that incorporates simple semantics by splitting the tweet term space into groups of terms that have the meaning of the same type. These groups are called semantic categories (classes) and each reflects one or more event aspects. The semantic classes include named entity, mention, location, hashtag, verb, noun and embedded link. To group tweets talking about the same event into the same cluster, similarity measuring is conducted by calculating class-wise similarity and then aggregating them together. Users of a real-time event detection system are usually only interested in novel (new) events, which are happening now or just happened a short time ago. To fulfill this requirement, a temporal identification module is used to filter out event clusters that are about old stories. The clustering module also computes a novelty score for each event cluster, which reflects how novel the event is, compared to previous events. We evaluated our event detection method using multiple quality metrics and a large-scale event corpus having millions of tweets. The experiment results show that the proposed online event detection method achieves the state-of-the-art performance. Our experiment also shows that the temporal identification module can effectively detect old events.
international conference on data mining | 2015
Armineh Nourbakhsh; Xiaomo Liu; Sameena Shah; Rui Fang; Mohammad M. Ghassemi; Quanzhi Li
Rumor events differ in how and where they originate, what topics they address, the emotions they invoke, and how they engage their audience. In this paper, we study various semantic aspects of rumors and analyze the motivational and functional roles they play. Using Twitter as a case study, we develop a framework to characterize rumors. Our characterization covers intrinsic and extrinsic factors, tweet and event-level, as well as usage analysis. We determine the roles various user-types play and analyze rumor propagation from both a re-tweeting and burstiness perspective.
conference on information and knowledge management | 2016
Quanzhi Li; Sameena Shah; Armineh Nourbakhsh; Xiaomo Liu; Rui Fang
In this paper, we present a new approach of recommending hashtags for tweets. It uses Learning to Rank algorithm to incorporate features built from topic enhanced word embeddings, tweet entity data, hashtag frequency, hashtag temporal data and tweet URL domain information. The experiments using millions of tweets and hashtags show that the proposed approach outperforms the three baseline methods -- the LDA topic, the tf.idf based and the general word embedding approaches.
empirical methods in natural language processing | 2016
Rui Fang; Armineh Nourbakhsh; Xiaomo Liu; Sameena Shah; Quanzhi Li
Identifying witness accounts is important for rumor debunking, crises management, and basically any task that involves on the ground eyes. The prevalence of social media has provided citizen journalism with scale and eye witnesses prominence. However, the amount of noise on social media also makes it likely that witness accounts get buried too deep in the noise and are never discovered. In this paper, we explore automatic witness identification in Twitter during emergency events. We attempt to create a generalizable system that not only detects witness reports for unseen events, but also on true out-of-sample “real time streaming set” that may or may not have witness accounts. We attempt to detect the presence or surge of witness accounts, which is the first step in developing a model for detecting crisis-related events. We collect and annotate witness tweets for different types of events (earthquake, car accident, fire, cyclone, etc.) explore the related features and build a classifier to identify witness tweets in real time. Our system is able to significantly outperform prior methods with an average F-score of 89.7% on previously unseen events.
web intelligence | 2016
Quanzhi Li; Sameena Shah; Rui Fang; Armineh Nourbakhsh; Xiaomo Liu
Previous studies have used many manually identified features and word embeddings for tweet sentiment classification. In this paper, we propose a new approach, which incorporates sentiment-specific word embeddings (SSWE) and a weighted text feature model (WTFM). WTFM produces features based on text negation, tf.idf weighting scheme, and a Rocchio text classification method. Compared to other tweet sentiment feature generation approaches, WTFM is easy to build, simple, yet effective. Experiments show that the proposed approach outperforms the two state-of-the-art tweet sentiment classification methods, SSWE and National Research Council Canadas (NRC) model.
international conference on big data | 2016
Quanzhi Li; Sameena Shah; Mohammad M. Ghassemi; Rui Fang; Armineh Nourbakhsh; Xiaomo Liu
Two of the major problems in social media message classification are the data sparseness issue and the high degree of lexical variation. Paraphrases, or synonyms, are alternative ways of expressing the same meaning using different lexical variations. In this study, we try to use paraphrases to improve tweet topic classification performance. We explored two approaches to generating paraphrases, WordNet, which is a lexical database grouping English words into sets of synonyms, and word embeddings, which are learned from millions of tweets and billions of words. Our experiment shows that using paraphrases can improve the topic classification task, and the word embedding approach outperforms the WordNet method. To our knowledge, this is the first study exploiting paraphrases for tweet classification.
conference on information and knowledge management | 2015
Xiaomo Liu; Armineh Nourbakhsh; Quanzhi Li; Rui Fang; Sameena Shah