Raghu Ramakrishnan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raghu Ramakrishnan is active.

Explore More

Publication

Featured researches published by Raghu Ramakrishnan.

Communications of The ACM | 2014

Big data and its technical challenges

H. V. Jagadish; Johannes Gehrke; Alexandros Labrinidis; Yannis Papakonstantinou; Jignesh M. Patel; Raghu Ramakrishnan; Cyrus Shahabi

Exploring the inherent technical challenges in realizing the potential of Big Data.

symposium on cloud computing | 2012

Sailfish: a framework for large scale data processing

Sriram Rao; Raghu Ramakrishnan; Adam Silberstein; Michael Ovsiannikov; Damian Reeves

In this paper, we present Sailfish, a new Map-Reduce framework for large scale data processing. The Sailfish design is centered around aggregating intermediate data, specifically data produced by map tasks and consumed later by reduce tasks, to improve performance by batching disk I/O. We introduce an abstraction called I-files for supporting data aggregation, and describe how we implemented it as an extension of the distributed filesystem, to efficiently batch data written by multiple writers and read by multiple readers. Sailfish adapts the Map-Reduce layer in Hadoop to use I-files for transporting data from map tasks to reduce tasks. We present experimental results demonstrating that Sailfish improves performance of standard Hadoop; in particular, we show 20% to 5 times faster performance on a representative mix of real jobs and datasets at Yahoo!. We also demonstrate that the Sailfish design enables auto-tuning functionality that handles changes in data volume and skewed distributions effectively, thereby addressing an important practical drawback of Hadoop, which in contrast relies on programmers to configure system parameters appropriately for each job, for each input dataset. Our Sailfish implementation and the other software components developed as part of this paper has been released as open source.

symposium on cloud computing | 2014

Reservation-based Scheduling: If You're Late Don't Blame Us!

Carlo Curino; Djellel Eddine Difallah; Chris Douglas; Subru Krishnan; Raghu Ramakrishnan; Sriram Rao

The continuous shift towards data-driven approaches to business, and a growing attention to improving return on investments (ROI) for cluster infrastructures is generating new challenges for big-data frameworks. Systems originally designed for big batch jobs now handle an increasingly complex mix of computations. Moreover, they are expected to guarantee stringent SLAs for production jobs and minimize latency for best-effort jobs. In this paper, we introduce reservation-based scheduling, a new approach to this problem. We develop our solution around four key contributions: 1) we propose a reservation definition language (RDL) that allows users to declaratively reserve access to cluster resources, 2) we formalize planning of current and future cluster resources as a Mixed-Integer Linear Programming (MILP) problem, and propose scalable heuristics, 3) we adaptively distribute resources between production jobs and best-effort jobs, and 4) we integrate all of this in a scalable system named Rayon, that builds upon Hadoop / YARN. We evaluate Rayon on a 256-node cluster against workloads derived from Microsoft, Yahoo!, Facebook, and Cloud-eras clusters. To enable practical use of Rayon, we open-sourced our implementation as part of Apache Hadoop 2.6.

symposium on cloud computing | 2012

True elasticity in multi-tenant data-intensive compute clusters

Ganesh Ananthanarayanan; Christopher William Douglas; Raghu Ramakrishnan; Sriram Rao; Ion Stoica

Data-intensive computing (DISC) frameworks scale by partitioning a job across a set of fault-tolerant tasks, then diffusing those tasks across large clusters. Multi-tenanted clusters must accommodate service-level objectives (SLO) in their resource model, often expressed as a maximum latency for allocating the desired set of resources to every job. When jobs are partitioned into tasks statically, a cluster cannot meet its SLOs while maintaining both high utilization and efficiency. Ideally, we want to give resources to jobs when they are free but would expect to reclaim them instantaneously when new jobs arrive, without losing work. DISC frameworks do not support such elasticity because interrupting running tasks incurs high overheads. Amoeba enables lightweight elasticity in DISC frameworks by identifying points at which running tasks of over-provisioned jobs can be safely exited, committing their outputs, and spawning new tasks for the remaining work. Effectively, tasks of DISC jobs are now sized dynamically in response to global resource scarcity or abundance. Simulation and deployment of our prototype shows that Amoeba speeds up jobs by 32% without compromising utilization or efficiency.

IEEE Transactions on Knowledge and Data Engineering | 2012

Data Cube Materialization and Mining over MapReduce

Arnab Nandi; Cong Yu; Philip Bohannon; Raghu Ramakrishnan

Computing interesting measures for data cubes and subsequent mining of interesting cube groups over massive data sets are critical for many important analyses done in the real world. Previous studies have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent advancement of parallel computing infrastructure such as MapReduce. Dealing with holistic measures such as TOP-K, however, is nontrivial. In this paper, we detail real-world challenges in cube materialization and mining tasks on web-scale data sets. Specifically, we identify an important subset of holistic measures and introduce MR-Cube, a MapReduce-based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our data sets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple data sets.

Communications of The ACM | 2013

Content recommendation on web portals

Deepak Agarwal; Bee-Chung Chen; Pradheep Elango; Raghu Ramakrishnan

How to offer recommendations to users when they have not specified what they want.

international conference on management of data | 2015

REEF: Retainable Evaluator Execution Framework

Markus Weimer; Yingda Chen; Byung-Gon Chun; Tyson Condie; Carlo Curino; Chris Douglas; Yunseong Lee; Tony Majestro; Dahlia Malkhi; Sergiy Matusevych; Brandon Myers; Shravan M. Narayanamurthy; Raghu Ramakrishnan; Sriram Rao; Russell Sears; Beysim Sezgin; Julia Wang

Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault-tolerance, task scheduling and coordination) and re-implement common mechanisms (e.g., caching, bulk-data transfers). This paper presents REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource re-use for data caching, and state management abstractions that greatly ease the development of elastic data processing work-flows on cloud platforms that support a Resource Manager service. REEF is being used to develop several commercial offerings such as the Azure Stream Analytics service. Furthermore, we demonstrate REEF development of a distributed shell application, a machine learning algorithm, and a port of the CORFU [4] system. REEF is also currently an Apache Incubator project that has attracted contributors from several instititutions.1 http://reef.incubator.apache.org

international conference on management of data | 2015

Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?

Christopher Ré; Divy Agrawal; Magdalena Balazinska; Michael J. Cafarella; Michael I. Jordan; Tim Kraska; Raghu Ramakrishnan

Machine learning seems to be eating the world with a new breed of high-value data-driven applications in image analysis, search, voice recognition, mobile, and office productivity products. To paraphrase Mike Stonebraker, machine learning is no longer a zero-billion-dollar business. As the home of high-value, data-driven applications for over four decades, a natural question for database researchers to ask is: what role should the database community play in these new data-driven machine-learning-based applications?

international conference on management of data | 2017

Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics

Raghu Ramakrishnan; Baskar Sridharan; John R. Douceur; Pavan Kasturi; Balaji Krishnamachari-Sampath; Karthick Krishnamoorthy; Peng Li; Mitica Manu; Spiro Michaylov; Rogerio Ramos; Neil Sharman; Zee Xu; Youssef Barakat; Chris Douglas; Richard P. Draves; Shrikant S. Naidu; Shankar Shastry; Atul Sikaria; Simon Sun; Ramarathnam Venkatesan

Azure Data Lake Store (ADLS) is a fully-managed, elastic, scalable, and secure file system that supports Hadoop distributed file system (HDFS) and Cosmos semantics. It is specifically designed and optimized for a broad spectrum of Big Data analytics that depend on a very high degree of parallel reads and writes, as well as collocation of compute and data for high bandwidth and low-latency access. It brings together key components and features of Microsoft?s Cosmos file system-long used by internal customers at Microsoft and HDFS, and is a unified file storage solution for analytics on Azure. Internal and external workloads run on this unified platform. Distinguishing aspects of ADLS include its design for handling multiple storage tiers, exabyte scale, and comprehensive security and data sharing features. We present an overview of ADLS architecture, design points, and performance.

international conference on management of data | 2014

Should we all be teaching "intro to data science" instead of "intro to databases"?

Bill Howe; Michael J. Franklin; Juliana Freire; James Frew; Tim Kraska; Raghu Ramakrishnan

The Database Community has a unique perspective on the challenges and solutions of long-term management of data and the value of data as a resource. In current computer science curricula, however, these insights are typically locked up in the context of the traditional Intro to Databases class that was developed years (or in some cases, decades) before the modern concept of Data Science arose and embedded in the discussion of legacy data management systems. We consider how to bring these concepts front and center into the emerging wave of Data Science courses, degree programs and even departments.

Explore More