Rekha Singhal
Tata Consultancy Services
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rekha Singhal.
international parallel and distributed processing symposium | 2016
Rekha Singhal; Abhishek Verma
MapReduce is a popular paradigm for processing big data due to wide availability of its implementations in open source such as Apache Hadoop. This framework is primarily designed for optimal execution of an application on a commodity cluster of homogeneous nodes where all machines in the cluster have identical hardware configurations. However, most organizations have surplus unused heterogeneous infrastructure which gets accumulated over a period of time - this may have machines with varying number of CPU cores, RAM and disk speed. The question for an organization is whether it is efficient to set up a heterogeneous Hadoop cluster on available set of different types of hardware configured nodes for executing their analytic workload. This may help organizations in reducing their e-waste and cost for data analytic. In this paper, we propose a simulator based what-if engine to predict job execution time of a MapReduce based application for varying size of cluster with varying types of heterogeneity in the cluster and growing data sizes. The simulator has been validated for three open source MapReduce benchmarks and two industrial Hive based analytic workloads on three different heterogeneous clusters for data sizes up to 100 GB. The largest cluster size considered is of 484 cores, with three types of hardware nodes. We have observed the average prediction error to be within 10% of the actual job execution time.
ieee india conference | 2012
Rekha Singhal; Manoj K. Nambiar
In a typical database application development, requirement is to optimize SQL queries to meet service level agreements (SLA); the optimized queries are tested on the application development database which is some fraction of the production database. As time progresses the database grows and the earlier optimized queries may not hold SLA anymore. Once the application is launched and deployed, it becomes difficult and expensive to modify the SQL queries. In this paper, we have discussed a model for predicting the SQL query cost and hence the SQL query elapsed response time with the growth of the database at application development time. We have identified and discussed the database statistics which can impact the SQL query cost with increase in the database size and how these can be used to predict ERT. We have tested the model on Oracle 10g and have presented the results in the paper.
measurement and modeling of computer systems | 2016
Manoj K. Nambiar; Ajay Kattepur; Gopal Bhaskaran; Rekha Singhal; Subhasri Duttagupta
Performance model solvers and simulation engines have been around for more than two decades. Yet, performance modeling has not received wide acceptance in the software industry, unlike pervasion of modeling and simulation tools in other industries. This paper explores underlying causes and looks at challenges that need to be overcome to increase utility of performance modeling, in order to make critical decisions on software based products and services. Multiple real-world case studies and examples are included to highlight our viewpoints on performance engineering. Finally, we conclude with some possible directions the performance modeling community could take, for better predictive capabilities required for industrial use.
Technology Conference on Performance Evaluation and Benchmarking | 2017
Rekha Singhal; Praveen Kumar Singh
The wide availability of open source big data processing frameworks, such as Spark, has increased migration of existing applications and deployment of new applications to these cost-effective platforms. One of the challenges is assuring performance of an application with increase in data size in production system. We have addressed this problem in our work for Spark platform using a performance prediction model in development environment. We have proposed a grey box approach to estimate an application execution time on Spark cluster for higher data size using measurements on low volume data in a small size cluster. The proposed model may also be used iteratively to estimate the competent cluster size for desired application performance in production environment. We have discussed both machine learning and analytic based techniques to build the model. The model is also flexible to different configurations of Spark cluster. This flexibility enables the use of the prediction model with optimization techniques to get tuned value of Spark parameters for optimal performance of deployed application on Spark cluster. Our key innovations in building Spark performance prediction model are support for different configurations of Spark platform, and simulator to estimate Spark stage execution time which includes task execution variability due to HDFS, data skew and cluster nodes heterogeneity. We have shown that our proposed approaches are able to predict within 20% error bound for Wordcount, Terasort, K-means and few TPC-H SQL workloads.
international conference on data technologies and applications | 2016
Chetan Phalak; Rekha Singhal; Tanmay Jhunjhunwala
Usage of an electronic media is increasing day by day and consequently the usage of applications. This fact has resulted in rapid growth of an applications data which may lead to violation of service level agreement (SLA) given to its users. To keep applications SLA compliance, it is necessary to predict the query response time before its deployment. The query response time comprises of two elements, computation time and IO access time. The latter includes time spent in getting data from disk subsystem and database/operating system (OS) cache. Correct prediction of a query performance needs to model cache behavior for growing data size. The complex nature of data storage and data access pattern by queries brings in difficulty to use only mathematical model for cache behavior prediction. In this paper, a Database Buffer Cache Simulator has been proposed, which mimics the behavior of the database buffer cache, which can be used to predict the cache misses for different types of data access by a query. The simulator has been validated using Oracle 11g and TPC-H benchmarks. The simulator is able to predict cache misses with an average error of 2%.
international conference on performance engineering | 2018
Rekha Singhal; Chetan Phalak; Praveen Kumar Singh
Spark is one of most widely deployed in-memory big data technology for parallel data processing across cluster of machines. The availability of these big data platforms on commodity machines has raised the challenge of assuring performance of applications with increase in data size. We have build a tool to assist application developer and tester to estimate an application execution time for larger data size before deployment. Conversely, the tool may also be used to estimate the competent cluster size for desired application performance in production environment. The tool can be used for detailed profiling of Spark job, post execution, to understand performance bottleneck. This tool incorporates different configurations of Spark cluster to estimate application performance. Therefore, it can also be used with optimization techniques to get tuned value of Spark parameters for an optimal performance. The tools key innovations are support for different configurations of Spark platform for performance prediction and simulator to estimate Spark stage execution time which includes task execution variability due to HDFS, data skew and cluster nodes heterogeneity. The tool using model [3] has been shown to predict within 20% error bound for Wordcount, Terasort,Kmeans and few SQL workloads.
international conference on performance engineering | 2018
Todor Ivanov; Rekha Singhal
Distributed big data processing and analytics applications demand a comprehensive end-to-end architecture stack consisting of big data technologies. However, there are many possible architecture patterns (e.g. Lambda, Kappa or Pipeline architectures) to choose from when implementing the application requirements. A big data technology in isolation may be best performing for a particular application, but its performance in connection with other technologies depends on the connectors and the environment. Similarly, existing big data benchmarks evaluate the performance of different technologies in isolation, but no work has been done on benchmarking big data architecture stacks as a whole. For example, BigBench (TPCx-BB) may be used to evaluate the performance of Spark, but is it applicable to PySpark or to Spark with Kafka stack as well? What is the impact of having different programming environments and/or any other technology like Spark? This vision paper proposes a new category of benchmark, called ABench, to fill this gap and discusses key aspects necessary for the performance evaluation of different big data architecture stacks.
international conference on performance engineering | 2017
Rekha Singhal; Chetan Phalak
Typically, applications are tested on small data size for both functional and non functional requirements. However, in production environment, the applications, having SQL queries, may experience performance violations due to increase in data volume. There is need to have tool which could test SQL query performance for large data sizes without elongating application testing phase. In this paper, we have presented a tool for estimating SQL query execution time for large data sizes without actually generating and loading the large volume of data. The model behind the working of the tool has been validated with TPC-H benchmarks and industry applications to predict within 10% average prediction error. The tool is built using underlying popular open source project CoDD with better project management and user interfaces.
international conference on performance engineering | 2017
Rekha Singhal; Shruti Kunde
Application and/or data migration is a result of limitations in existing system architecture to handle new requirements and the availability of newer, more efficient technology. In any big data architecture, technology migration is staggered across multiple levels and poses functional (related to components of the architecture and underlying infrastructure) and non-functional (QoS) challenges such as availability, reliability and performance guarantees in the target architecture. In this paper, (1) we outline a big data architecture stack and identify research problems arising out of the technology migration in this scenario (2) we propose a smart rule engine system which facilitates the decision making process for the technology to be used at different layers in the architecture during migration.
international conference on performance engineering | 2016
Rekha Singhal
Digitization of user services and cheap access to the internet has led to two critical problems- quick response to end-user queries and faster analysis of large accumulated data to serve users better. This has also led to the advent of various big data processing technologies, each of them has architecture specific parameters to tune for optimal execution of the application. There are also challenges in optimal scheduling of analytic queries for faster analysis, which lead to the problem of estimating analytic queries execution time for large data sizes on the production system. A production system may be an enterprise database system or a cluster of machines with Hadoop etc, where each machine may be of different hardware configuration (known as heterogeneous environment). In the first part of this tutorial, we shall present need and challenges for tuning big data applications on various platforms. This is followed by discussion on various existing solutions for application tuning. The second part of the tutorial presents the challenges and state of the art for estimating application execution time.