Is this you? Create Your Porfile

Tathagata Das

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tathagata Das is active.

Explore More

Publication

Featured researches published by Tathagata Das.

acm special interest group on data communication | 2012

DeTail: reducing the flow completion time tail in datacenter networks

David Zats; Tathagata Das; Prashanth Mohan; Dhruba Borthakur; Randy H. Katz

Web applications have now become so sophisticated that rendering a typical page may require hundreds of intra-datacenter flows. At the same time, web sites must meet strict page creation deadlines of 200-300ms to satisfy user demands for interactivity. Long-tailed flow completion times make it challenging for web sites to meet these constraints. They are forced to choose between rendering a subset of the complex page, or delay its rendering, thus missing deadlines and sacrificing either quality or responsiveness. Either option leads to potential financial loss. In this paper, we present a new cross-layer network stack aimed at reducing the long tail of flow completion times. The approach exploits cross-layer information to reduce packet drops, prioritize latency-sensitive flows, and evenly distribute network load, effectively reducing the long tail of flow completion times. We evaluate our approach through NS-3 based simulation and Click-based implementation demonstrating our ability to consistently reduce the tail across a wide range of workloads. We often achieve reductions of over 50% in 99.9th percentile flow completion times.

Communications of The ACM | 2016

Apache Spark: a unified engine for big data processing

Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica

This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

very large data bases | 2015

Scaling spark in the real world: performance and usability

Michael Armbrust; Tathagata Das; Aaron Davidson; Ali Ghodsi; Andrew Or; Josh Rosen; Ion Stoica; Patrick Wendell; Reynold S. Xin; Matei Zaharia

Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.

Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems | 2016

Time-evolving graph processing at scale

Anand Padmanabha Iyer; Li Erran Li; Tathagata Das; Ion Stoica

Time-evolving graph-structured big data arises naturally in many application domains such as social networks and communication networks. However, existing graph processing systems lack support for efficient computations on dynamic graphs. In this paper, we represent most computations on time evolving graphs into (1) a stream of consistent and resilient graph snapshots, and (2) a small set of operators that manipulate such streams of snapshots. We then introduce GraphTau, a time-evolving graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphTau quickly builds fault-tolerant graph snapshots as each small batch of new data arrives. GraphTau achieves high performance and fault tolerant graph stream processing via a number of optimizations. GraphTau also unifies data streaming and graph streaming processing. Our preliminary evaluations on two representative datasets show promising results. Besides performance benefit, GraphTau API relieves programmers from handling graph snapshot generation, windowing operators and sophisticated differential computation mechanisms.

international conference on management of data | 2018

Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark

Michael Armbrust; Tathagata Das; Joseph Torres; Burak Yavuz; Shixiong Zhu; Reynold S. Xin; Ali Ghodsi; Ion Stoica; Matei Zaharia

With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQLs code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the systems design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.

networked systems design and implementation | 2012