Fred Morstatter
Arizona State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Fred Morstatter.
ACM Computing Surveys | 2017
Jundong Li; Kewei Cheng; Suhang Wang; Fred Morstatter; Robert P. Trevino; Jiliang Tang; Huan Liu
Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data-mining and machine-learning problems. The objectives of feature selection include building simpler and more comprehensible models, improving data-mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity-based, information-theoretical-based, sparse-learning-based, and statistical-based methods. To facilitate and promote the research in this community, we also present an open source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.
knowledge discovery and data mining | 2013
Fred Morstatter; Shamanth Kumar; Huan Liu; Ross Maciejewski
In the era of big data it is increasingly difficult for an analyst to extract meaningful knowledge from a sea of information. We present TweetXplorer, a system for analysts with little information about an event to gain knowledge through the use of effective visualization techniques. Using tweets collected during Hurricane Sandy as an example, we will lead the reader through a workflow that exhibits the functionality of the system.
advances in social networks analysis and mining | 2016
Fred Morstatter; Liang Wu; Tahora H. Nazer; Kathleen M. Carley; Huan Liu
The presence of bots has been felt in many aspects of social media. Twitter, one example of social media, has especially felt the impact, with bots accounting for a large portion of its users. These bots have been used for malicious tasks such as spreading false information about political candidates and inflating the perceived popularity of celebrities. Furthermore, these bots can change the results of common analyses performed on social media. It is important that researchers and practitioners have tools in their arsenal to remove them. Approaches exist to remove bots, however they focus on precision to evaluate their model at the cost of recall. This means that while these approaches are almost always correct in the bots they delete, they ultimately delete very few, thus many bots remain. We propose a model which increases the recall in detecting bots, allowing a researcher to delete more bots. We evaluate our model on two real-world social media datasets and show that our detection algorithm removes more bots from a dataset than current approaches.
computational social science | 2014
Fred Morstatter; Nichola Lubold; Heather Pon-Barry; Jürgen Pfeffer; Huan Liu
Disaster response agencies incorporate social media as a source of fast-breaking information to understand the needs of people affected by the many crises that occur around the world. These agencies look for tweets from within the region affected by the crisis to get the latest updates on the status of the affected region. However only 1% of all tweets are “geotagged” with explicit location information. In this work we seek to identify non-geotagged tweets that originate from within the crisis region. Towards this, we address three questions: (1) is there a difference between the language of tweets originating within a crisis region, (2) what linguistic patterns differentiate within-region and outside-region tweets, and (3) can we automatically identify those originating within the crisis region in real-time?
international world wide web conferences | 2016
Fred Morstatter; Harsh Dani; Justin Sampson; Huan Liu
While social media mining continues to be an active area of research, obtaining data for research is a perennial problem. Even more, obtaining unbiased data is a challenge for researchers who wish to study human behavior, and not technical artifacts induced by the sampling algorithm of a social media site. In this work, we evaluate one social media data outlet that gives data to its users in the form of a stream: Twitters Sample API. We show that in its current form, this API can be poisoned by bots or spammers who wish to promote their content, jeopardizing the credibility of the data collected through this API. We design a proof-of-concept algorithm that shows how malicious users could increase the probability of their content appearing in the Sample API, thus biasing the content towards spam and bot content and harming the representativity of this data outlet.
acm conference on hypertext | 2015
Justin Sampson; Fred Morstatter; Ross Maciejewski; Huan Liu
Social media services have become a prominent source of research data for both academia and corporate applications. Data from social media services is easy to obtain, highly structured, and comprises opinions from a large number of extremely diverse groups. The microblogging site, Twitter, has garnered a particularly large following from researchers by offering a high volume of data streamed in real time. Unfortunately, the methods in which Twitter selects data to disseminate through the stream are either vague or unpublished. Since Twitter maintains sole control of the sampling process, it leaves us with no knowledge of how the data that we collect for research is selected. Additionally, past research has shown that there are sources of bias present in Twitters dissemination process. Such bias introduces noise into the data that can reduce the accuracy of learning models and lead to bad inferences. In this work, we take an initial look at the efficiency of Twitter limit track as a sample population estimator. After that, we provide methods to mitigate bias by improving sample population coverage using clustering techniques.
conference on information and knowledge management | 2016
Justin Sampson; Fred Morstatter; Liang Wu; Huan Liu
The automatic and early detection of rumors is of paramount importance as the spread of information with questionable veracity can have devastating consequences. This became starkly apparent when, in early 2013, a compromised Associated Press account issued a tweet claiming that there had been an explosion at the White House. This tweet resulted in a significant drop for the Dow Jones Industrial Average. Most existing work in rumor detection leverages conversation statistics and propagation patterns, however, such patterns tend to emerge slowly requiring a conversation to have a significant number of interactions in order to become eligible for classification. In this work, we propose a method for classifying conversations within their formative stages as well as improving accuracy within mature conversations through the discovery of implicit linkages between conversation fragments. In our experiments, we show that current state-of-the-art rumor classification methods can leverage implicit links to significantly improve the ability to properly classify emergent conversations when very little conversation data is available. Adopting this technique allows rumor detection methods to continue to provide a high degree of classification accuracy on emergent conversations with as few as a single tweet. This improvement virtually eliminates the delay of conversation growth inherent in current rumor classification methods while significantly increasing the number of conversations considered viable for classification.
PLOS ONE | 2014
Dror Y. Kenett; Fred Morstatter; H. Eugene Stanley; Huan Liu
Twitter is a major social media platform in which users send and read messages (“tweets”) of up to 140 characters. In recent years this communication medium has been used by those affected by crises to organize demonstrations or find relief. Because traffic on this media platform is extremely heavy, with hundreds of millions of tweets sent every day, it is difficult to differentiate between times of turmoil and times of typical discussion. In this work we present a new approach to addressing this problem. We first assess several possible “thermostats” of activity on social media for their effectiveness in finding important time periods. We compare methods commonly found in the literature with a method from economics. By combining methods from computational social science with methods from economics, we introduce an approach that can effectively locate crisis events in the mountains of data generated on Twitter. We demonstrate the strength of this method by using it to locate the social events relating to the Occupy Wall Street movement protests at the end of 2011.
knowledge discovery and data mining | 2012
Shamanth Kumar; Fred Morstatter; Grant Marshall; Huan Liu; Ullas Nambiar
Recent years have seen an exponential increase in the number of users of social media sites. As the number of users of these sites continues to grow at an extraordinary rate, the amount of data produced follows in magnitude. With this deluge of social media data, the need for comprehensive tools to analyze user interactions is ever increasing. In this paper, we present a novel tool, Navigating Information Facets on Twitter (NIF-T), which helps users to explore data generated on social media sites. Using the three dimensions or facets: time, location, and topic as an example of the many possible facets, we enable the users to explore large social media datasets. With the help of a large corpus of tweets collected from the Occupy Wall Street movement on the Twitter platform we show how our system can be used to identify important aspects of the event along these facets.
advances in social networks analysis and mining | 2013
Kathleen M. Carley; Jürgen Pfeffer; Huan Liu; Fred Morstatter; Rebecca Goolsby
When a crisis occurs, there is often little time to evaluate the situation and determine how best to respond. We use rapid ethnographic methods centered on the construction of geo-temporally contextualized social and knowledge networks. By utilizing a combination of Twitter and news media, the consulate attack in Libya were examined in near real time. In this work we outline a procedure to extract key insights from the event as an event unfolds using a suite of tools developed by a team of researchers from two universities.