Sofus A. Macskassy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sofus A. Macskassy is active.

Explore More

Publication

Featured researches published by Sofus A. Macskassy.

analytics for noisy unstructured text data | 2010

Discovering users' topics of interest on twitter: a first look

Matthew Michelson; Sofus A. Macskassy

Twitter, a micro-blogging service, provides users with a framework for writing brief, often-noisy postings about their lives. These posts are called Tweets. In this paper we present early results on discovering Twitter users topics of interest by examining the entities they mention in their Tweets. Our approach leverages a knowledge base to disambiguate and categorize the entities in the Tweets. We then develop a topic profile, which characterizes users topics of interest, by discerning which categories appear frequently and cover the entities. We demonstrate that even in this early work we are able to successfully discover the main topics of interest for the users in our study.

international conference on machine learning | 2005

ROC confidence bands: an empirical evaluation

Sofus A. Macskassy; Foster Provost; Saharon Rosset

This paper is about constructing confidence bands around ROC curves. We first introduce to the machine learning community three band-generating methods from the medical field, and evaluate how well they perform. Such confidence bands represent the region where the true ROC curve is expected to reside, with the designated confidence level. To assess the containment of the bands we begin with a synthetic world where we know the true ROC curve---specifically, where the class-conditional model scores are normally distributed. The only method that attains reasonable containment out-of-the-box produces non-parametric, fixed-width bands (FWBs). Next we move to a context more appropriate for machine learning evaluations: bands that with a certain confidence level will bound the performance of the model on future data. We introduce a correction to account for the larger uncertainty, and the widened FWBs continue to have reasonable containment. Finally, we assess the bands on 10 relatively large benchmark data sets. We conclude by recommending these FWBs, noting that being non-parametric they are especially attractive for machine learning studies, where the score distributions (1) clearly are not normal, and (2) even for the same data set vary substantially from learning method to learning method.

knowledge discovery and data mining | 2009

Using graph-based metrics with empirical risk minimization to speed up active learning on networked data

Sofus A. Macskassy

Active and semi-supervised learning are important techniques when labeled data are scarce. Recently a method was suggested for combining active learning with a semi-supervised learning algorithm that uses Gaussian fields and harmonic functions. This classifier is relational in nature: it relies on having the data presented as a partially labeled graph (also known as a within-network learning problem). This work showed yet again that empirical risk minimization (ERM) was the best method to find the next instance to label and provided an efficient way to compute ERM with the semi-supervised classifier. The computational problem with ERM is that it relies on computing the risk for all possible instances. If we could limit the candidates that should be investigated, then we can speed up active learning considerably. In the case where the data is graphical in nature, we can leverage the graph structure to rapidly identify instances that are likely to be good candidates for labeling. This paper describes a novel hybrid approach of using of community finding and social network analytic centrality measures to identify good candidates for labeling and then using ERM to find the best instance in this candidate set. We show on real-world data that we can limit the ERM computations to a fraction of instances with comparable performance.

international acm sigir conference on research and development in information retrieval | 2001

Intelligent information triage

Sofus A. Macskassy; Foster Provost

In many applications, large volumes of time-sensitive textual information require triage: rapid, approximate prioritization for subsequent action. In this paper, we explore the use of prospective indications of the importance of a time-sensitive document, for the purpose of producing better document filtering or ranking. By prospective, we mean importance that could be assessed by actions that occur in the future. For example, a news story may be assessed (retrospectively) as being important, based on events that occurred after the story appeared, such as a stock price plummeting or the issuance of many follow-up stories. If a system could anticipate (prospectively) such occurrences, it could provide a timely indication of importance. Clearly, perfect prescience is impossible. However, sometimes there is sufficient correlation between the content of an information item and the events that occur subsequently. We describe a process for creating and evaluating approximate information-triage procedures that are based on prospective indications. Unlike many information-retrieval applications for which document labeling is a laborious, manual process, for many prospective criteria it is possible to build very large, labeled, training corpora automatically. Such corpora can be used to train text classification procedures that will predict the (prospective) importance of each document. This paper illustrates the process with two case studies, demonstrating the ability to predict whether a news story will be followed by many, very similar news stories, and also whether the stock price of one or more companies associated with a news story will move significantly following the appearance of that story. We conclude by discussing how the comprehensibility of the learned classifiers can be critical to success.}

international conference on machine learning | 2006

A brief survey of machine learning methods for classification in networked data and an application to suspicion scoring

Sofus A. Macskassy; Foster Provost

This paper surveys work from the field of machine learning on the problem of within-network learning and inference. To give motivation and context to the rest of the survey, we start by presenting some (published) applications of within-network inference. After a brief formulation of this problem and a discussion of probabilistic inference in arbitrary networks, we survey machine learning work applied to networked data, along with some important predecessors--mostly from the statistics and pattern recognition literature. We then describe an application of within-network inference in the domain of suspicion scoring in social networks. We close the paper with pointers to toolkits and benchmark data sets used in machine learning research on classification in network data. We hope that such a survey will be a useful resource to workshop participants, and perhaps will be complemented by others.

Artificial Intelligence | 2003

Converting numerical classification into text classification

Sofus A. Macskassy; Haym Hirsh; Arunava Banerjee; Aynur A. Dayanik

Consider a supervised learning problem in which examples contain both numerical- and text-valued features. To use traditional feature-vector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic-in the most straight-forward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of text classification on the derived text-like representation outperforms the more naive numbers-as-tokens representation and, more importantly, is competitive with mature numerical classification methods such as C4.5, Ripper, and SVM. We further show that on mixed-mode data adding numerical features using our approach can improve performance over not adding those features.

Social Network Analysis and Mining | 2011

Contextual linking behavior of bloggers: leveraging text mining to enable topic-based analysis

Sofus A. Macskassy

The last decade has seen an explosion in blogging and the blogosphere is continuing to grow, having a large global reach and many vibrant communities. Researchers have been pouring over blog data with the goal of finding communities, tracking what people are saying, finding influencers, and using many social network analytic tools to analyze the underlying social networks embedded within the blogosphere. One of the key technical problems with analyzing large social networks such as those embedded in the blogosphere is that there are many links between individuals and we often do not know the context or meaning of those links. This is problematic because it makes it difficult if not impossible to tease out the true communities, their behavior, how information flows, and who the central players are (if any). This paper seeks to further our understanding of how to analyze large blog networks and what they can tell us. We analyze 1.13M blogs posted by 185K bloggers over a period of 3xa0weeks. These bloggers span private blog sites through large blog-sites such as LiveJournal and Blogger. We show that we can, in fact, tag links in meaningful ways by leveraging topic-detection over the blogs themselves. We use these topics to contextually tag links coming from a particular blog post. This enrichment enables us to create smaller topic-specific graphs which we can analyze in some depth. We show that these topic-specific graphs not only have a different topology from the general blog graph but also enable us to find central bloggers which were otherwise hard to find. We further show that a temporal analysis identifies behaviors in terms of how components form as well as how bloggers continue to link after components form. These behaviors come to light when doing an analysis on the topic-specific graphs but are hidden or not easily discernable when analyzing the general blog graph.

advances in social networks analysis and mining | 2010

Leveraging Contextual Information to Explore Posting and Linking Behaviors of Bloggers

Sofus A. Macskassy

The last decade has seen an explosion in blogging and the blogosphere is continuing to grow, having a large global reach and many vibrant communities. Researchers have been pouring over blog data with the goal of finding communities, tracking what people are saying, finding influencers, and using many social network analytic tools to analyze the underlying social networks embedded within the blogosphere. One of the key technical problems with analyzing large social networks such as those embedded in the blogosphere is that there are many links between individuals and we often do not know the context or meaning of those links. This is problematic because it makes it difficult if not impossible to tease out the true communities, their behavior, how information flows, and who the central players are (if any). This paper seeks to further our understanding of how to analyze large blog networks and what they can tell us. We analyze 1.24M blogs posted by 298K bloggers over a period of three weeks. These bloggers span private blog sites through large blog-sites such as live journal and blogspot. We first characterize the behavior of bloggers, validating some (but not all) common beliefs about how often bloggers post, how long their posts are, who they link to and how much reciprocity there is in links. We then take a look at bloggers from the larger blog sites to understand whether and how they differ in terms of these metrics. Finally, we extend our analysis to focus on contextual links: what is the textual content of the blog which had a link. We identify topics from the textual content of all the blog posts and use these to tag links based on the topics that were discussed in the blog.

web search and data mining | 2011

What blogs tell us about websites: a demographics study

Matthew Michelson; Sofus A. Macskassy

One challenge for content providers on the Web is determining who consumes their content. For instance, online newspapers want to know who is reading their articles. Previous approaches have tried to determine such audience demographics by placing cookies on users systems, or by directly asking consumers (e.g., through surveys). The first approach may make users uncomfortable, and the second is not scalable. In this paper we focus on determining the demographics of a Websites audience by analyzing the blogs that link to the Website. We analyze both the text of the blogs and the network connectivity of the blog network to determine demographics such as whether a person is married or has pets. Presumably bloggers linking to sites also consume the content of those sites. Therefore, the discovered demographics for the bloggers can be used to represent a proxy set of demographics for a subset of the Websites consumers. We demonstrate that in many cases we can infer sub-audiences for a site from these demographics. Further, this feasibility demonstrates that very specific demographics for sites can be generated as we improve the methods for determining them (e.g., finding people who play video games). In our study we analyze blogs collected from more than 590,000 bloggers collected over a six month period that link to more than 488,000 distinct, external websites.

international conference on machine learning and applications | 2011

Relational Classifiers in a Non-relational World: Using Homophily to Create Relations

Sofus A. Macskassy

Research in the past decade on statistical relational learning (SRL) has shown the power of the underlying network of relations in relational data. Even models built using only relations often perform comparably to models built using sophisticated relational learning methods. However, many data sets -- such as those in the UCI machine learning repository -- contain no relations. In fact, many data sets either do not contain relations or have relations which are not helpful to a specific classification task. The question we investigate in this paper is whether it is possible to construct relations such that relational inference results in better classification performance than non-relational inference. Using simple similarity-based rules to create relations and weighting the strength of these relations using homophily on instance labels, we test whether relational inference techniques are applicable -- in other words, do they perform comparably to standard machine learning algorithms. We show, in an experimental study on 31 UCI benchmark data sets, that relational inference wins more than any of the 6 classifiers we compare against, including a transductive SVM, and that it wins the majority of the time when compared against any one of them.

Explore More