Shivakumar Vaithyanathan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shivakumar Vaithyanathan is active.

Explore More

Publication

Featured researches published by Shivakumar Vaithyanathan.

empirical methods in natural language processing | 2002

Thumbs up? Sentiment Classification using Machine Learning Techniques

Bo Pang; Lillian Lee; Shivakumar Vaithyanathan

We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging.

international conference on data engineering | 2011

SystemML: Declarative machine learning on MapReduce

Amol Ghoting; Rajasekar Krishnamurthy; Edwin P. D. Pednault; Berthold Reinwald; Vikas Sindhwani; Shirish Tatikonda; Yuanyuan Tian; Shivakumar Vaithyanathan

MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.

very large data bases | 2007

OLAP over uncertain and imprecise data

Douglas Burdick; Prasad M. Deshpande; T. S. Jayram; Raghu Ramakrishnan; Shivakumar Vaithyanathan

We extend the OLAP data model to represent data ambiguity, specifically imprecision and uncertainty, and introduce an allocation-based approach to the semantics of aggregation queries over such data. We identify three natural query properties and use them to shed light on alternative query semantics. While there is much work on representing and querying ambiguous data, to our knowledge this is the first paper to handle both imprecision and uncertainty in an OLAP setting.

international conference on management of data | 2009

SystemT: a system for declarative information extraction

Rajasekar Krishnamurthy; Yunyao Li; Sriram Raghavan; Frederick R. Reiss; Shivakumar Vaithyanathan; Huaiyu Zhu

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.

international acm sigir conference on research and development in information retrieval | 1997

Exploiting clustering and phrases for context-based information retrieval

Peter Anick; Shivakumar Vaithyanathan

This paper explores exploiting the synergy between document clustering and phrasal analysis for the purpose of automatically constructing a corrrex~-busedretrieval system. A contex~ consists of two components a cluster of logically related articles (its exrension) and a small set of salient concepts, represented by words and phrases and organized by the cluster’s key terms (its irr~ertsion). At inn-time, the system presents contexts that best match the result list of a user’s natural language query. The user can then choose a context and manipulate the intensionsd component to both browse the context’s extension and launch new searches over the entire database. We argue that the focused relevance feedback provided by contexts, at a level of abstraction higher than individual documents and lower than the database as a whole, provides a natural way for users to refine vague information needs and helps to blur the distinction between searching and browsing. The I%zraphrase interface, running over a database of business-related news articles, is used to illustrate the advantages of such a context-based retrieval paradigm.

international conference on data engineering | 2008

An Algebraic Approach to Rule-Based Information Extraction

Frederick R. Reiss; Sriram Raghavan; Rajasekar Krishnamurthy; Huaiyu Zhu; Shivakumar Vaithyanathan

Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.

meeting of the association for computational linguistics | 2004

The Sentimental Factor: Improving Review Classification Via Human-Provided Information

Philip Beineke; Trevor Hastie; Shivakumar Vaithyanathan

Sentiment classification is the task of labeling a review document according to the polarity of its prevailing opinion (favorable or unfavorable). In approaching this problem, a model builder often has three sources of information available: a small collection of labeled documents, a large collection of unlabeled documents, and human understanding of language. Ideally, a learning method will utilize all three sources. To accomplish this goal, we generalize an existing procedure that uses the latter two.We extend this procedure by re-interpreting it as a Naive Bayes model for document sentiment. Viewed as such, it can also be seen to extract a pair of derived features that are linearly combined to predict sentiment. This perspective allows us to improve upon previous methods, primarily through two strategies: incorporating additional derived features into the model and, where possible, using labeled data to estimate their relative influence.

empirical methods in natural language processing | 2008

Regular Expression Learning for Information Extraction

Yunyao Li; Rajasekar Krishnamurthy; Sriram Raghavan; Shivakumar Vaithyanathan; H. V. Jagadish

Regular expressions have served as the dominant workhorse of practical information extraction for several years. However, there has been little work on reducing the manual effort involved in building high-quality, complex regular expressions for information extraction tasks. In this paper, we propose ReLIE, a novel transformation-based algorithm for learning such complex regular expressions. We evaluate the performance of our algorithm on multiple datasets and compare it against the CRF algorithm. We show that ReLIE, in addition to being an order of magnitude faster, outperforms CRF under conditions of limited training data and cross-domain data. Finally, we show how the accuracy of CRF can be improved by using features extracted by ReLIE.

international conference on management of data | 2006

Avatar semantic search: a database approach to information retrieval

Eser Kandogan; Rajasekar Krishnamurthy; Sriram Raghavan; Shivakumar Vaithyanathan; Huaiyu Zhu

We present Avatar Semantic Search, a prototype search engine that exploits annotations in the context of classical keyword search. The process of annotations is accomplished offline by using high-precision information extraction techniques to extract facts, con-cepts, and relationships from text. These facts and concepts are represented and indexed in a structured data store. At runtime, keyword queries are interpreted in the context of these extracted facts and converted into one or more precise queries over the structured store. In this demonstration we describe the overall architecture of the Avatar Semantic Search engine. We also demonstrate the superiority of the AVATAR approach over traditional keyword search engines using Enron email data set and a blog corpus.

international conference on management of data | 2006

Managing information extraction: state of the art and research directions

AnHai Doan; Raghu Ramakrishnan; Shivakumar Vaithyanathan

This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text). We first survey research on information extraction in the database, AI, NLP, IR, and Web communities in recent years. Then we discuss why this is the right time for the database community to actively participate and address the problem of managing information extraction (including in particular the challenges of maintaining and querying the extracted information, and accounting for the imprecision and uncertainty inherent in the extraction process). Finally, we show how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools. We do not assume prior knowledge of text management, NLP, extraction techniques, or machine learning.

Explore More