Tathagata Das
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tathagata Das.
acm special interest group on data communication | 2012
David Zats; Tathagata Das; Prashanth Mohan; Dhruba Borthakur; Randy H. Katz
Web applications have now become so sophisticated that rendering a typical page may require hundreds of intra-datacenter flows. At the same time, web sites must meet strict page creation deadlines of 200-300ms to satisfy user demands for interactivity. Long-tailed flow completion times make it challenging for web sites to meet these constraints. They are forced to choose between rendering a subset of the complex page, or delay its rendering, thus missing deadlines and sacrificing either quality or responsiveness. Either option leads to potential financial loss. In this paper, we present a new cross-layer network stack aimed at reducing the long tail of flow completion times. The approach exploits cross-layer information to reduce packet drops, prioritize latency-sensitive flows, and evenly distribute network load, effectively reducing the long tail of flow completion times. We evaluate our approach through NS-3 based simulation and Click-based implementation demonstrating our ability to consistently reduce the tail across a wide range of workloads. We often achieve reductions of over 50% in 99.9th percentile flow completion times.
Communications of The ACM | 2016
Matei Zaharia; Reynold S. Xin; Patrick Wendell; Tathagata Das; Michael Armbrust; Ankur Dave; Xiangrui Meng; Josh Rosen; Shivaram Venkataraman; Michael J. Franklin; Ali Ghodsi; Joseph E. Gonzalez; Scott Shenker; Ion Stoica
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
very large data bases | 2015
Michael Armbrust; Tathagata Das; Aaron Davidson; Ali Ghodsi; Andrew Or; Josh Rosen; Ion Stoica; Patrick Wendell; Reynold S. Xin; Matei Zaharia
Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.
Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems | 2016
Anand Padmanabha Iyer; Li Erran Li; Tathagata Das; Ion Stoica
Time-evolving graph-structured big data arises naturally in many application domains such as social networks and communication networks. However, existing graph processing systems lack support for efficient computations on dynamic graphs. In this paper, we represent most computations on time evolving graphs into (1) a stream of consistent and resilient graph snapshots, and (2) a small set of operators that manipulate such streams of snapshots. We then introduce GraphTau, a time-evolving graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphTau quickly builds fault-tolerant graph snapshots as each small batch of new data arrives. GraphTau achieves high performance and fault tolerant graph stream processing via a number of optimizations. GraphTau also unifies data streaming and graph streaming processing. Our preliminary evaluations on two representative datasets show promising results. Besides performance benefit, GraphTau API relieves programmers from handling graph snapshot generation, windowing operators and sophisticated differential computation mechanisms.
international conference on management of data | 2018
Michael Armbrust; Tathagata Das; Joseph Torres; Burak Yavuz; Shixiong Zhu; Reynold S. Xin; Ali Ghodsi; Ion Stoica; Matei Zaharia
With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQLs code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the systems design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.
networked systems design and implementation | 2012
Matei Zaharia; Mosharaf Chowdhury; Tathagata Das; Ankur Dave; Justin Ma; Murphy McCauley; Michael J. Franklin; Scott Shenker; Ion Stoica
symposium on operating systems principles | 2013
Matei Zaharia; Tathagata Das; Haoyuan Li; Timothy Hunter; Scott Shenker; Ion Stoica
usenix conference on hot topics in cloud ccomputing | 2012
Matei Zaharia; Tathagata Das; Haoyuan Li; Scott Shenker; Ion Stoica
symposium on cloud computing | 2014
Tathagata Das; Yuan Zhong; Ion Stoica; Scott Shenker
IEEE Transactions on Automation Science and Engineering | 2013
Timothy Hunter; Tathagata Das; Matei Zaharia; Pieter Abbeel; Alexandre M. Bayen