Kathy Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kathy Lee is active.

Explore More

Publication

Featured researches published by Kathy Lee.

international conference on data mining | 2011

Twitter Trending Topic Classification

Kathy Lee; Diana Palsetia; Ramanathan Narayanan; Md. Mostofa Ali Patwary; Ankit Agrawal; Alok N. Choudhary

With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about. Therefore, it is important and necessary to classify these topics into general categories with high accuracy for better information retrieval. To address this problem, we classify Twitter Trending Topics into 18 general categories such as sports, politics, technology, etc. We experiment with 2 approaches for topic classification, (i) the well-known Bag-of-Words approach for text classification and (ii) network-based classification. In text-based classification method, we construct word vectors with trending topic definition and tweets, and the commonly used tf-idf weights are used to classify the topics using a Naive Bayes Multinomial classifier. In network-based classification method, we identify top 5 similar topics for a given topic based on the number of common influential users. The categories of the similar topics and the number of common influential users between the given topic and its similar topics are used to classify the given topic using a C5.0 decision tree learner. Experiments on a database of randomly selected 768 trending topics (over 18 classes) show that classification accuracy of up to 65% and 70% can be achieved using text-based and network-based classification modeling respectively.

knowledge discovery and data mining | 2013

Real-time disease surveillance using Twitter data: demonstration on flu and cancer

Kathy Lee; Ankit Agrawal; Alok N. Choudhary

Social media is producing massive amounts of data on an unprecedented scale. Here people share their experiences and opinions on various topics, including personal health issues, symptoms, treatments, side-effects, and so on. This makes publicly available social media data an invaluable resource for mining interesting and actionable healthcare insights. In this paper, we describe a novel real-time flu and cancer surveillance system that uses spatial, temporal, and text mining on Twitter data. The real-time analysis results are reported visually in terms of US disease surveillance maps, distribution and timelines of disease types, symptoms, and treatments, in addition to overall disease activity timelines on our project website. Our surveillance system can be very useful not only for early prediction of seasonal disease outbreaks such as flu, but also for monitoring distribution of cancer patients with different cancer types and symptoms in each state and the popularity of treatments used. The resulting insights are expected to help facilitate faster response to and preparation for epidemics and also be very useful for both patients and doctors to make more informed decisions.

Communications of The ACM | 2012

Social media evolution of the Egyptian revolution

Alok N. Choudhary; William Hendrix; Kathy Lee; Diana Palsetia; Wei-keng Liao

Twitter sentiment was revealed, along with popularity of Egypt-related subjects and tweeter influence on the 2011 revolution.

international conference on data mining | 2011

SES: Sentiment Elicitation System for Social Media Data

Kunpeng Zhang; Yu Cheng; Yusheng Xie; Daniel Honbo; Ankit Agrawal; Diana Palsetia; Kathy Lee; Wei-keng Liao; Alok N. Choudhary

Social Media is becoming major and popular technological platform that allows users discussing and sharing information. Information is generated and managed through either computer or mobile devices by one person and consumed by many other persons. Most of these user generated content are textual information, as Social Networks(Face book, Linked In), Microblogging(Twitter), blogs(Blogspot, Word press). Looking for valuable nuggets of knowledge, such as capturing and summarizing sentiments from these huge amount of data could help users make informed decisions. In this paper, we develop a sentiment identification system called SES which implements three different sentiment identification algorithms. We augment basic compositional semantic rules in the first algorithm. In the second algorithm, we think sentiment should not be simply classified as positive, negative, and objective but a continuous score to reflect sentiment degree. All word scores are calculated based on a large volume of customer reviews. Due to the special characteristics of social media texts, we propose a third algorithm which takes emoticons, negation word position, and domain-specific words into account. Furthermore, a machine learning model is employed on features derived from outputs of three algorithms. We conduct our experiments on user comments from Face book and tweets from twitter. The results show that utilizing Random Forest will acquire a better accuracy than decision tree, neural network, and logistic regression. We also propose a flexible way to represent document sentiment based on sentiments of each sentence contained. SES is available online.

international world wide web conferences | 2017

Adverse Drug Event Detection in Tweets with Semi-Supervised Convolutional Neural Networks

Kathy Lee; Ashequl Qadir; Sadid A. Hasan; Vivek V. Datla; Aaditya Prakash; Joey Liu; Oladimeji Farri

Current Adverse Drug Events (ADE) surveillance systems are often associated with a sizable time lag before such events are published. Online social media such as Twitter could describe adverse drug events in real-time, prior to official reporting. Deep learning has significantly improved text classification performance in recent years and can potentially enhance ADE classification in tweets. However, these models typically require large corpora with human expert-derived labels, and such resources are very expensive to generate and are hardly available. Semi-supervised deep learning models, which offer a plausible alternative to fully supervised models, involve the use of a small set of labeled data and a relatively larger collection of unlabeled data for training. Traditionally, these models are trained on labeled and unlabeled data from similar topics or domains. In reality, millions of tweets generated daily often focus on disparate topics, and this could present a challenge for building deep learning models for ADE classification with random Twitter stream as unlabeled training data. In this work, we build several semi-supervised convolutional neural network (CNN) models for ADE classification in tweets, specifically leveraging different types of unlabeled data in developing the models to address the problem. We demonstrate that, with the selective use of a variety of unlabeled data, our semi-supervised CNN models outperform a strong state-of-the-art supervised classification model by +9.9% F1-score. We evaluated our models on the Twitter data set used in the PSB 2016 Social Media Shared Task. Our results present the new state-of-the-art for this data set.

annual computer security applications conference | 2014

Spam ain't as diverse as it seems: throttling OSN spam with templates underneath

Hongyu Gao; Yi Yang; Kai Bu; Yan Chen; Doug Downey; Kathy Lee; Alok N. Choudhary

In online social networks (OSNs), spam originating from friends and acquaintances not only reduces the joy of Internet surfing but also causes damage to less security-savvy users. Prior countermeasures combat OSN spam from different angles. Due to the diversity of spam, there is hardly any existing method that can independently detect the majority or most of OSN spam. In this paper, we empirically analyze the textual pattern of a large collection of OSN spam. An inspiring finding is that the majority (63.0%) of the collected spam is generated with underlying templates. We therefore propose extracting templates of spam detected by existing methods and then matching messages against the templates toward accurate and fast spam detection. We implement this insight through Tangram, an OSN spam filtering system that performs online inspection on the stream of user-generated messages. Tangram automatically divides OSN spam into segments and uses the segments to construct templates to filter future spam. Experimental results show that Tangram is highly accurate and can rapidly generate templates to throttle newly emerged campaigns. Specifically, Tangram detects the most prevalent template-based spam with 95.7% true positive rate, whereas the existing template generation approach detects only 32.3%. The integration of Tangram and its auxiliary spam filter achieves an overall accuracy of 85.4% true positive rate and 0.33% false positive rate.

IEEE ACM Transactions on Networking | 2016

Beating the Artificial Chaos: Fighting OSN Spam Using Its Own Templates

Tiantian Zhu; Hongyu Gao; Yi Yang; Kai Bu; Yan Chen; Doug Downey; Kathy Lee; Alok N. Choudhary

Online social networks (OSNs) are extremely popular among Internet users. However, spam originating from friends and acquaintances not only reduces the joy of Internet surfing but also causes damage to less security-savvy users. Prior countermeasures combat OSN spam from different angles. Due to the diversity of spam, there is hardly any existing method that can independently detect the majority or most of OSN spam. In this paper, we empirically analyze the textual pattern of a large collection of OSN spam. An inspiring finding is that the majority (e.g., 76.4% in 2015) of the collected spam is generated with underlying templates. Based on the analysis, we propose tangram, an OSN spam filtering system that performs online inspection on the stream of user-generated messages. Tangram extracts the templates of spam detected by existing methods and then matching messages against the templates toward the accurate and the fast spam detection. It automatically divides the OSN spam into segments and uses the segments to construct templates to filter future spam. Experimental results on Twitter and Facebook data sets show that tangram is highly accurate and can rapidly generate templates to throttle newly emerged campaigns. Furthermore, we analyze the behavior of detected OSN spammers. We find a series of spammer properties-such as spamming accounts are created in bursts and a single active organization orchestrates more spam than all other spammers combined-that promise more comprehensive spam countermeasures.

ieee international conference on healthcare informatics | 2017

Forecasting Influenza Levels Using Real-Time Social Media Streams

Kathy Lee; Ankit Agrawal; Alok N. Choudhary

Seasonal influenza is a contagious respiratory illness that can cause various complications, worsen chronic illnesses, and sometimes lead to deaths. During 2009 H1N1 flu pandemic, up to 203,000 deaths occurred worldwide. Early detection and prediction of disease outbreak is critical because it can provide more time to prepare a response and significantly reduce the impact caused by a pandemic. The traditional influenza surveillance system by Centers for Disease Control and Prevention (CDC) collects U.S. Influenza-Like Illness related physicians visits data from sentinel practices and provides a retrospective analysis delayed by two weeks. Google Flu Trends proposed a method that uses online search queries data to estimate current (real-time) influenza activity. Here we present a system that (1) predicts future influenza activities, (2) provides more accurate real-time assessment than before, and (3) combines real-time big social media data streams and CDC historical datasets for predictive models to accomplish accurate predictions. Although retrospective analysis and observations are important, prediction of future flu levels can represent a big leap because such predictions provide actionable insights for public health that can be used for planning, resource allocation, treatments and prevention. Thus, compared to previous work, our work represents an advancement in accuracy of assessments, prediction of future flu activity accurately and an ability to combine big social data and observed CDC data to build predictive models.

ieee international conference on healthcare informatics | 2017

Medical Concept Normalization for Online User-Generated Texts

Kathy Lee; Sadid A. Hasan; Oladimeji Farri; Alok N. Choudhary; Ankit Agrawal

Social media has become an important tool for sharing content in the last decade. People often talk about their experiences and opinions on different health-related issues e.g. they write reviews on medications, describe symptoms and ask informal questions about various health concerns. Due to the colloquial nature of the languages used in the social media, it is often difficult for an automated system to accurately interpret them for appropriate clinical understanding. To address this challenge, this paper proposes a novel approach for medical concept normalization of user-generated texts to map a health condition described in the colloquial language to a medical concept defined in standard clinical terminologies. We use multiple deep learning architectures such as convolutional neural networks (CNN) and recurrent neural networks (RNN) with input word embeddings trained on various clinical domain-specific knowledge sources. Extensive experiments on two benchmark datasets demonstrate that the proposed models can achieve up to 21.28% accuracy improvements over the existing models when we use the combination of all knowledge sources to learn neural embeddings.

network and distributed system security symposium | 2012