Subramanya R. Dulloor
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Subramanya R. Dulloor.
international conference on management of data | 2015
Joy Arulraj; Andrew Pavlo; Subramanya R. Dulloor
The advent of non-volatile memory (NVM) will fundamentally change the dichotomy between memory and durable storage in database management systems (DBMSs). These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existing DBMSs are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy DBMSs are unnecessary and will degrade the performance of data intensive applications. To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates. We then present NVM-aware variants of these architectures that leverage the persistence and byte-addressability properties of NVM in their storage and recovery methods. Our experimental evaluation on an NVM hardware emulator shows that these engines achieve up to 5.5X higher throughput than their traditional counterparts while reducing the amount of wear due to write operations by up to 2X. We also demonstrate that our NVM-aware recovery protocols allow these engines to recover almost instantaneously after the DBMS restarts.
very large data bases | 2015
Narayanan Sundaram; Nadathur Satish; Md. Mostofa Ali Patwary; Subramanya R. Dulloor; Michael J. Anderson; Satya Gautam Vadlamudi; Dipankar Das; Pradeep Dubey
Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We thus get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is a single-node multicore graph framework written in C++ which has enabled us to write a diverse set of graph algorithms with the same effort compared to other vertex programming frameworks. GraphMat performs 1.1-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. GraphMat also matches the performance of MapGraph, a GPU-based graph framework, despite running on a CPU platform with significantly lower compute and bandwidth resources. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMat can naturally benefit from the trend of increasing parallelism in future hardware.
symposium on operating systems principles | 2015
Jasmina Malicevic; Subramanya R. Dulloor; Narayanan Sundaram; Nadathur Satish; Jeff Jackson; Willy Zwaenepoel
Data center applications like graph analytics require servers with ever larger memory capacities. DRAM scaling, however, is not able to match the increasing demands for capacity. Emerging byte-addressable, non-volatile memory technologies (NVM) offer a more scalable alternative, with memory that is directly addressable to software, but at a higher latency and lower bandwidth. Using an NVM hardware emulator, we study the suitability of NVM in meeting the memory demands of four state of the art graph analytics frameworks, namely Graphlab, Galois, X-Stream and Graphmat. We evaluate their performance with popular algorithms (Pagerank, BFS, Triangle Counting and Collaborative filtering) by allocating memory exclusive from DRAM (DRAM-only) or emulated NVM (NVM-only). While all of these applications are sensitive to higher latency or lower bandwidth of NVM, resulting in performance degradation of up to 4x with NVM-only (compared to DRAM-only), we show that the performance impact is somewhat mitigated in the frameworks that exploit CPU memory-level parallelism and hardware prefetchers. Further, we demonstrate that, in a hybrid memory system with NVM and DRAM, intelligent placement of application data based on their relative importance may help offset the overheads of the NVM-only solution in a cost-effective manner (i.e., using only a small amount of DRAM). Specifically, we show that, depending on the algorithm, Graphmat can achieve close to DRAM-only performance (within 1.2x) by placing only 6.7% to 31.5% of its total memory footprint in DRAM.
very large data bases | 2017
Michael R. Anderson; Shaden Smith; Narayanan Sundaram; Mihai Capotă; Zheguang Zhao; Subramanya R. Dulloor; Nadathur Satish; Theodore L. Willke
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.
data management on new hardware | 2016
Lin Ma; Joy Arulraj; Sam Zhao; Andrew Pavlo; Subramanya R. Dulloor; Michael Giardino; Jeff Parkhurst; Jason L. Gardner; Stanley B. Zdonik
In-memory database management systems (DBMSs) outperform disk-oriented systems for on-line transaction processing (OLTP) workloads. But this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system. To overcome this limitation, some in-memory DBMSs can move cold data out of volatile DRAM to secondary storage. Such data appears as if it resides in memory with the rest of the database even though it does not. Although there have been several implementations proposed for this type of cold data storage, there has not been a thorough evaluation of the design decisions in implementing this technique, such as policies for when to evict tuples and how to bring them back when they are needed. These choices are further complicated by the varying performance characteristics of different storage devices, including future non-volatile memory technologies. We explore these issues in this paper and discuss several approaches to solve them. We implemented all of these approaches in an in-memory DBMS and evaluated them using five different storage technologies. Our results show that choosing the best strategy based on the hardware improves throughput by 92-340% over a generic configuration.
Proceedings of the 16th Workshop on Hot Topics in Operating Systems | 2017
Ehsan Totoni; Subramanya R. Dulloor; Amitabha Roy
Big data systems such as Spark are built around the idea of splitting an iterative parallel program into tiny tasks with other aspects of system design built around this basic design principle. Unfortunately, in spite of immense engineering effort, tiny tasks have unavoidably large overheads. We use the example of logistic regression -- a common machine learning primitive -- to compare the performance of Spark to different designs that converge to a hand-coded parallel MPI-based implementation. We conclude that Spark leaves orders of magnitude performance on the table, due to its insistence on setting the granularity of a task to a single iteration. We counter a common argument for the tiny task approach --namely better resilience to faults -- by demonstrating that optimum job checkpoint intervals are far longer than the duration of the tiny tasks favored in Sparks design. We propose an alternative approach that relies on an auto-parallelizing compiler tightly integrated with the MPI runtime, illustrating the opposite end of the spectrum where task granularities are as large as possible.
very large data bases | 2014
Justin DeBrabant; Joy Arulraj; Andrew Pavlo; Michael Stonebraker; Stanley B. Zdonik; Subramanya R. Dulloor
usenix annual technical conference | 2014
Philip R. Lantz; Subramanya R. Dulloor; Sanjay Kumar; Rajesh Sankaran; Jeff Jackson
Archive | 2013
Subramanya R. Dulloor; Sanjay Kumar; Rajesh M. Sankaran; Gilbert Neiger; Richard Uhlig; Robert S. Chappell; Joseph Nuzman; Kai Cheng; Sailesh Kottapalli; Yen-Cheng Liu; Mohan J. Kumar; Raj K. Ramanujan; Glenn J. Hinton
Archive | 2017
Subramanya R. Dulloor; Rajesh M. Sankaran; Sanjay Kumar