Sandeep Tata
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sandeep Tata.
very large data bases | 2011
Jun Rao; Eugene J. Shekita; Sandeep Tata
Spinnaker is an experimental datastore that is designed to run on a large cluster of commodity servers in a single datacenter. It features key-based range partitioning, 3-way replication, and a transactional get-put API with the option to choose either strong or timeline consistency on reads. This paper describes Spinnakers Paxos-based replication protocol. The use of Paxos ensures that a data partition in Spinnaker will be available for reads and writes as long a majority of its replicas are alive. Unlike traditional master-slave replication, this is true regardless of the failure sequence that occurs. We show that Paxos replication can be competitive with alternatives that provide weaker consistency guarantees. Compared to an eventually consistent datastore, we show that Spinnaker can be as fast or even faster on reads and only 5% to 10% slower on writes.
very large data bases | 2011
Avrilia Floratou; Jignesh M. Patel; Eugene J. Shekita; Sandeep Tata
Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude.
international conference on management of data | 2008
Sandeep Tata; Guy M. Lohman
Todays enterprise databases are large and complex, often relating hundreds of entities. Enabling ordinary users to query such databases and derive value from them has been of great interest in database research. Today, keyword search over relational databases allows users to find pieces of information without having to write complicated SQL queries. However, in order to compute even simple aggregates, a user is required to write a SQL statement and can no longer use simple keywords. This not only requires the ordinary user to learn SQL, but also to learn the schema of the complex database in detail in order to correctly construct the required query. This greatly limits the options of the user who wishes to examine a database in more depth. As a solution to this problem, we propose a framework called SQAK1 (SQL Aggregates using Keywords) that enables users to pose aggregate queries using simple keywords with little or no knowledge of the schema. SQAK provides a novel and exciting way to trade-off some of the expressive power of SQL in exchange for the ability to express a large class of aggregate queries using simple keywords. SQAK accomplishes this by taking advantage of the data in the database and the schema (tables, attributes, keys, and referential constraints). SQAK does not require any changes to the database engine and can be used with any existing database. We demonstrate using several experiments that SQAK is effective and can be an enormously powerful tool for ordinary users.
IEEE Transactions on Knowledge and Data Engineering | 2011
Avrilia Floratou; Sandeep Tata; Jignesh M. Patel
Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of “approximate” patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLexible and Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always finds the pattern if it exists. Using both real and synthetic data sets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also address a more general problem, named extended structured motif extraction, which allows mining frequent combinations of motifs under relaxed constraints.
extending database technology | 2012
Tim Kaldewey; Eugene J. Shekita; Sandeep Tata
MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing importance of processing structured data on MapReduce platforms. MapReduce offers several attractive properties such as the use of low-cost hardware, fault-tolerance, scalability, and elasticity. However, these advantages have required a substantial performance sacrifice. In this paper we introduce Clydesdale, a novel system for structured data processing on Hadoop -- a popular implementation of MapReduce. We show that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform. Clydesdale is aimed at workloads where the data fits a star schema. It draws on column oriented storage, tailored join-plans, and multi-core execution strategies and carefully fits them into the constraints of a typical MapReduce platform. Using the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive. This demonstrates that MapReduce in general, and Hadoop in particular, is a far more compelling platform for structured data processing than previous results suggest.
cloud data management | 2009
Ning Li; Jun Rao; Eugene J. Shekita; Sandeep Tata
Many content-oriented applications require a scalable text index. Building such an index is challenging. In addition to the logic of inserting and searching documents, developers have to worry about issues in a typical distributed environment, such as fault tolerance, incrementally growing the index cluster, and load balancing. We developed a distributed text index called HIndex, by judiciously exploiting the control layer of HBase, which is an open source implementation of Googles Bigtable. Such leverage enables us to inherit the support on availability, elasticity and load balancing in HBase. We present the design, implementation, and a performance evaluation of HIndex in this paper.
international conference on data engineering | 2010
Avrilia Floratou; Sandeep Tata; Jignesh M. Patel
Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of applications, such as biological DNA and protein motif mining, require efficient mining of “approximate” patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and difficulty in adapting to other applications. In this paper, we present a new algorithm called FLAME (FLexible and Accurate Motif DEtector). FLAME is a flexible suffix tree based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always find the pattern if it exists. Using both real and synthetic datasets, we demonstrate that FLAME is fast, scalable, and outperforms existing algorithms on a variety of performance metrics. Using FLAME, it is now possible to mine datasets that would have been prohibitively difficult with existing tools.
international conference on management of data | 2012
Andrey Balmin; Tim Kaldewey; Sandeep Tata
There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for structured data processing on Hadoop that can achieve dramatic performance improvements over existing solutions, without any changes to the underlying MapReduce implementation. Clydesdale achieves this through a novel synthesis of several techniques from the database literature and carefully adapting them to the Hadoop environment. On the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive, the dominant approach for structured data processing on Hadoop today. To the best of our knowledge, Clydesdale is the fastest solution for processing workloads on structured data sets that fit a star schema on Hadoop. Attendees will be able to run queries on the data from the star schema benchmark on a remote Hadoop cluster with Clydesdale and Hive installed, and get a breakdown of the time taken to execute the query. Attendees will also be able to pose their own queries using ClyQL -- a novel embedded DSL in Scala that can be used to rapidly prototype star join queries. With this demonstration, we hope to convince the attendees that unlike previously thought, Hadoop can indeed efficiently support structured data processing.
Ibm Journal of Research and Development | 2013
Liana L. Fong; Yuqing Gao; Xavier R. Guerin; Yonggang Liu; T. Salo; Seetharami R. Seelam; Wei Tan; Sandeep Tata
Emerging transactional workloads from Internet and mobile commerce require low-latency, massive-scale, and integrated data analytics to enhance user experience and to improve up-selling opportunities. These analytics require new application platforms that must be able to absorb large volumes of data, provide low-latency access to the data, and cache data objects to improve access times in distributed environments. This paper reports on recent technologies built at IBM Research to address challenges in data access latency, data ingestion, and caching in the exemplary context of an online product recommendation application. We describe three technologies related to the issues and optimizations of key-value data object store and access. First, we describe the architecture of a global secondary index to greatly improve data access latency of Hadoop™ Database (HBase™), an open-source key-value distributed data store. Second, we present an in-memory write-ahead log feature on HBase that significantly improves write operations for high-volume data ingestion. Third, we detail an innovative distributed caching system that exploits low-latency interconnects to use hash maps of data keys on each server for local lookup, while data resides and are accessed across clustered systems. The distributed cache can achieve a 100-to 1,000-fold performance gain over many caching methods. These technologies together form some necessary building blocks for a next-generation data-centric middleware for integrated transaction and analytic workloads.
international conference on data engineering | 2008
Sandeep Tata; Lin Qiao; Guy M. Lohman
Modern enterprises often deploy multiple databases from different vendors. Managing a heterogeneous mix of databases is a very challenging exercise. To help the DBA tackle this complex administrative task, major database vendors have provided many autonomic tools. The tools help automate common management tasks and even help in performance tuning. However, the DBA now has to face the complexity of dealing with a variety of different tools for different tasks, each with different interfaces and capabilities. The application developer also suffers from a similar problem when trying to use different tools that help him develop applications for different databases. Clearly, there is a need for a common interface to manage heterogeneous databases and develop enterprise class applications for them. To address this need, we argue for a client-based approach to developing database tools. In particular, we take the case of an index advisor (a tool provided by all the three big database vendors) and show that a client-based approach can a) provide a uniform interface for the DBA or the application developer to use across all the database deployments, b) provide flexibility in the number and kinds of scenarios it can be used, and finally c) reduce the total cost of ownership for the enterprise.