Is this you? Create Your Porfile

Vijay Gadepally

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vijay Gadepally is active.

Explore More

Publication

Featured researches published by Vijay Gadepally.

ieee high performance extreme computing conference | 2014

Computing on masked data: a high performance method for improving big data veracity

Jeremy Kepner; Vijay Gadepally; Peter Michaleas; Nabil Schear; Mayank Varia; Arkady Yerukhimovich; Robert K. Cunningham

The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Along with these standard three Vs of big data, an emerging fourth “V” is veracity, which addresses the confidentiality, integrity, and availability of the data. Traditional cryptographic techniques that ensure the veracity of data can have overheads that are too large to apply to big data. This work introduces a new technique called Computing on Masked Data (CMD), which improves data veracity by allowing computations to be performed directly on masked data and ensuring that only authorized recipients can unmask the data. Using the sparse linear algebra of associative arrays, CMD can be performed with significantly less overhead than other approaches while still supporting a wide range of linear algebraic operations on the masked data. Databases with strong support of sparse operations, such as SciDB or Apache Accumulo, are ideally suited to this technique. Examples are shown for the application of CMD to a complex DNA matching algorithm and to database operations over social media data.

ieee high performance extreme computing conference | 2014

Achieving 100,000,000 database inserts per second using Accumulo and D4M

Jeremy Kepner; David Bestor; Bill Bergeron; Chansup Byun; Vijay Gadepally; Matthew Hubbell; Peter Michaleas; Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Charles Yee

The Apache Accumulo database is an open source relaxed consistency database that is widely used for government applications. Accumulo is designed to deliver high performance on unstructured data such as graphs of network data. This paper tests the performance of Accumulo using data from the Graph500 benchmark. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a 216-node cluster running the MIT SuperCloud software stack. A peak performance of over 100,000,000 database inserts per second was achieved which is 100× larger than the highest previously published value for any other database. The performance scales linearly with the number of ingest clients, number of database servers, and data size. The performance was achieved by adapting several supercomputing techniques to this application: distributed arrays, domain decomposition, adaptive load balancing, and single-program-multiple-data programming.

IEEE Transactions on Intelligent Transportation Systems | 2014

A Framework for Estimating Driver Decisions Near Intersections

Vijay Gadepally; Ashok K. Krishnamurthy; Umit Ozguner

We present a framework for the estimation of driver behavior at intersections, with applications to autonomous driving and vehicle safety. The framework is based on modeling the driver behavior and vehicle dynamics as a hybrid-state system (HSS), with driver decisions being modeled as a discrete-state system and the vehicle dynamics modeled as a continuous-state system. The proposed estimation method uses observable parameters to track the instantaneous continuous state and estimates the most likely behavior of a driver given these observations. This paper describes a framework that encompasses the hybrid structure of vehicle-driver coupling and uses hidden Markov models (HMMs) to estimate driver behavior from filtered continuous observations. Such a method is suitable for scenarios that involve unknown decisions of other vehicles, such as lane changes or intersection access. Such a framework requires extensive data collection, and the authors describe the procedure used in collecting and analyzing vehicle driving data. For illustration, the proposed hybrid architecture and driver behavior estimation techniques are trained and tested near intersections with exemplary results provided. Comparison is made between the proposed framework, simple classifiers, and naturalistic driver estimation. Obtained results show promise for using the HSS-HMM framework.

international parallel and distributed processing symposium | 2015

Graphulo: Linear Algebra Graph Kernels for NoSQL Databases

Vijay Gadepally; Jake Bolewski; Dan Hook; Dylan Hutchison; Benjamin A. Miller; Jeremy Kepner

Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.

ieee high performance extreme computing conference | 2015

D4M: Bringing associative arrays to database engines

Vijay Gadepally; Jeremy Kepner; David Bestor; Bill Bergeron; Chansup Byun; Lauren Edwards; Matthew Hubbell; Peter Michaleas; Julie Mullen; Andrew Prout; Antonio Rosa; Charles Yee; Albert Reuther

The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety. Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple technologies, however, there is significant trouble in designing the movement of information between storage and database engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance of this connection.

ieee high performance extreme computing conference | 2014

A survey of cryptographic approaches to securing big-data analytics in the cloud

Sophia Yakoubov; Vijay Gadepally; Nabil Schear; Emily Shen; Arkady Yerukhimovich

The growing demand for cloud computing motivates the need to study the security of data received, stored, processed, and transmitted by a cloud. In this paper, we present a framework for such a study. We introduce a cloud computing model that captures a rich class of big-data use-cases and allows reasoning about relevant threats and security goals. We then survey three cryptographic techniques - homomorphic encryption, verifiable computation, and multi-party computation - that can be used to achieve these goals. We describe the cryptographic techniques in the context of our cloud model and highlight the differences in performance cost associated with each.

ieee high performance extreme computing conference | 2016

The BigDAWG polystore system and architecture

Vijay Gadepally; Peinan Chen; Jennie Duggan; Aaron J. Elmore; Brandon Haynes; Jeremy Kepner; Samuel Madden; Tim Mattson; Michael Stonebraker

Organizations are often faced with the challenge of providing data management solutions for large, heterogenous datasets that may have different underlying data and programming models. For example, a medical dataset may have unstructured text, relational data, time series waveforms and imagery. Trying to fit such datasets in a single data management system can have adverse performance and efficiency effects. As a part of the Intel Science and Technology Center on Big Data, we are developing a polystore system designed for such problems. BigDAWG (short for the Big Data Analytics Working Group) is a polystore system designed to work on complex problems that naturally span across different processing or storage engines. BigDAWG provides an architecture that supports diverse database systems working with different data models, support for the competing notions of location transparency and semantic completeness via islands and a middleware that provides a uniform multi-island interface. Initial results from a prototype of the BigDAWG system applied to a medical dataset validate polystore concepts. In this article, we will describe polystore databases, the current BigDAWG architecture and its application on the MIMIC II medical dataset, initial performance results and our future development plans.

ieee high performance extreme computing conference | 2017

Static graph challenge: Subgraph isomorphism

Siddharth Samsi; Vijay Gadepally; Michael B. Hurley; Michael Jones; Edward K. Kao; Sanjeev Mohindra; Paul Monticciolo; Albert Reuther; Steven Smith; William S. Song; Diane Staheli; Jeremy Kepner

The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The proposed Subgraph Isomorphism Graph Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a graph challenge that is reflective of many real-world graph analytics processing systems. The Subgraph Isomorphism Graph Challenge is a holistic specification with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. Subgraph isomorphism is amenable to both vertex-centric implementations and array-based implementations (e.g., using the Graph-BLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed graph challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.

ieee high performance extreme computing conference | 2015

Graphulo implementation of server-side sparse matrix multiply in the Accumulo database

Dylan Hutchison; Jeremy Kepner; Vijay Gadepally; Adam Fuchs

The Apache Accumulo database excels at distributed storage and indexing and is ideally suited for storing graph data. Many big data analytics compute on graph data and persist their results back to the database. These graph calculations are often best performed inside the database server. The GraphBLAS standard provides a compact and efficient basis for a wide range of graph applications through a small number of sparse matrix operations. In this article, we discuss a server-side implementation of GraphBLAS sparse matrix multiplication that leverages Accumulos native, high-performance iterators. We compare the mathematics and performance of inner and outer product implementations, and show how an outer product implementation achieves optimal performance near Accumulos peak write rate. We offer our work as a core component to the Graphulo library that will deliver matrix math primitives for graph analytics within Accumulo.

ieee high performance extreme computing conference | 2015

Using a Power Law distribution to describe big data

Vijay Gadepally; Jeremy Kepner

The gap between data production and user ability to access, compute and produce meaningful results calls for tools that address the challenges associated with big data volume, velocity and variety. One of the key hurdles is the inability to methodically remove expected or uninteresting elements from large data sets. This difficulty often wastes valuable researcher and computational time by expending resources on uninteresting parts of data. Social sensors, or sensors which produce data based on human activity, such as Wikipedia, Twitter, and Facebook have an underlying structure which can be thought of as having a Power Law distribution. Such a distribution implies that few nodes generate large amounts of data. In this article, we propose a technique to take an arbitrary dataset and compute a power law distributed background model that bases its parameters on observed statistics. This model can be used to determine the suitability of using a power law or automatically identify high degree nodes for filtering and can be scaled to work with big data.

Explore More