Navendu Jain | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Navendu Jain is active.

Explore More

Publication

Featured researches published by Navendu Jain.

acm special interest group on data communication | 2011

Understanding network failures in data centers: measurement, analysis, and implications

Phillipa Gill; Navendu Jain; Nachiappan Nagappan

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

international conference on management of data | 2006

Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

Navendu Jain; Lisa Amini; Henrique Andrade; Richard P. King; Yoonho Park; Philippe Selo; Chitra Venkatramani

Stream processing applications have recently gained significant attention in the networking and database community. At the core of these applications is a stream processing engine that performs resource allocation and management to support continuous tracking of queries over collections of physically-distributed and rapidly-updating data streams. While numerous stream processing systems exist, there has been little work on understanding the performance characteristics of these applications in a distributed setup. In this paper, we examine the performance bottlenecks of streaming data applications, in particular the Linear Road stream data management benchmark, in achieving good performance in large-scale distributed environments, using the Stream Processing Core (SPC), a stream processing middleware we have developed. First, we present the design and implementation of the Linear Road benchmark on the SPC middleware. SPC has been designed to scale to tens of thousands of processing nodes, while supporting concurrent applications and multiple simultaneous queries. Second, we identify the main performance bottlenecks in the Linear Road application in achieving scalability and low query response latency. Our results show that data locality, buffer capacity, physical allocation of processing elements to infrastructure nodes, and packaging for transporting streamed data are important factors in achieving good application performance. Though we evaluate our system primarily for the Linear Road application, we believe it also provides useful insights into the overall system behavior for supporting other distributed and large-scale continuous streaming data applications. Finally, we examine how SPC can be used and tuned to enable a very efficient implementation of the Linear Road application in a distributed environment.

international conference on computer communications | 2011

Managing cost, performance, and reliability tradeoffs for energy-aware server provisioning

Brian K. Guenter; Navendu Jain; Charles J. Williams

We present ACES, an automated server provisioning system that aims to meet workload demand while minimizing energy consumption in data centers. To perform energy-aware server provisioning, ACES faces three key tradeoffs between cost, performance, and reliability: (1) maximizing energy savings vs. minimizing unmet load demand, (2) managing low power draw vs. high transition latencies for multiple power management schemes, and (3) balancing energy savings vs. reliability costs of server components due to on-off cycles. To address these challenges, ACES (1) predicts demand in the near future to turn on servers gradually before they are needed and avoids turning on unnecessary servers to cope with transient load spikes, (2) formulates an optimization problem that minimizes a linear combination of unmet demand and total energy and reliability costs, and uses the program structure to solve the problem efficiently in practice, and (3) constructs an execution plan based on the optimization decisions to transition servers between different power states and actuates them using system and load management interfaces. Our evaluation on three data center workloads shows that ACESs energy savings are close to the optimal and it delivers power proportionality while balancing the tradeoff between energy savings and reliability costs.

international conference on distributed computing systems | 2006

Adaptive Control of Extreme-scale Stream Processing Systems

Lisa Amini; Navendu Jain; Anshul Sehgal; Jeremy I. Silber; Olivier Verscheure

Distributed stream processing systems offer a highly scalable and dynamically configurable platform for time-critical applications ranging from real-time, exploratory data mining to high performance transaction processing. Resource management for distributed stream processing systems is complicated by a number of factors processing elements are constrained by their producer-consumer relationships, data and processing rates can be highly bursty, and traditional measures of effectiveness, such as utilization, can be misleading. In this paper, we propose a novel distributed, adaptive control algorithm that maximizes weighted throughput while ensuring stable operation in the face of highly bursty workloads. Our algorithm is designed to meet the challenges of extreme-scale stream processing systems, where overprovisioning is not an option, by making the best use of resources even when the proffered load is greater than available resources. We have implemented our algorithm in a real-world distributed stream processing system and a simulation environment. Our results show that our algorithm is not only self-stabilizing and robust to errors, but also outperforms traditional approaches over a broad range of buffer sizes, processing graphs, and burstiness types and levels.

NETWORKING'11 Proceedings of the 10th international IFIP TC 6 conference on Networking - Volume Part I | 2011

Online job-migration for reducing the electricity bill in the cloud

Niv Buchbinder; Navendu Jain; Ishai Menache

Energy costs are becoming the fastest-growing element in datacenter operation costs. One basic approach to reduce these costs is to exploit the spatiotemporal variation in electricity prices by moving computation to datacenters in which energy is available at a cheaper price. However, injudicious job migration between datacenters might increase the overall operation cost due to the bandwidth costs of transferring application state and data over the wide-area network. To address this challenge, we propose novel online algorithms for migrating batch jobs between datacenters, which handle the fundamental tradeoff between energy and bandwidth costs. A distinctive feature of our algorithms is that they consider not only the current availability and cost of (possibly multiple) energy sources, but also the future variability and uncertainty thereof. Using the framework of competitive-analysis, we establish worst-case performance bounds for our basic online algorithm. We then propose a practical, easy-to-implement version of the basic algorithm, and evaluate it through simulations on real electricity pricing and job workload data. The simulation results indicate that our algorithm outperforms plausible greedy algorithms that ignore future outcomes. Notably, the actual performance of our approach is significantly better than the theoretical guarantees, within 6% of the optimal offline solution.

Lecture Notes in Computer Science | 2002

Verification of Timed Automata via Satisfiability Checking

Peter Niebert; Moez Mahfoudh; Eugene Asarin; Marius Bozga; Oded Maler; Navendu Jain

In this paper we show how to translate bounded-length verification problems for timed automata into formulae in difference logic, a propositional logic enriched with timing constraints. We describe the principles of a satisfiability checker specialized for this logic that we have implemented and report some preliminary experimental results.

internet measurement conference | 2013

Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Rahul Potharaju; Navendu Jain

Network appliances or middleboxes such as firewalls, intrusion detection and prevention systems (IDPS), load balancers, and VPNs form an integral part of datacenters and enterprise networks. Realizing their importance and shortcomings, the research community has proposed software implementations, policy-aware switching, consolidation appliances, moving middlebox processing to VMs, end hosts, and even offloading it to the cloud. While such efforts can use middlebox failure characteristics to improve their reliability, management, and cost-effectiveness, little has been reported on these failures in the field. In this paper, we make one of the first attempts to perform a large-scale empirical study of middlebox failures over two years in a service provider network comprising thousands of middleboxes across tens of datacenters. We find that middlebox failures are prevalent and they can significantly impact hosted services. Several of our findings differ in key aspects from commonly held views: (1) Most failures are grey dominated by connectivity errors and link flaps that exhibit intermittent connectivity, (2) Hardware faults and overload problems are present but they are not in majority, (3) Middleboxes experience a variety of misconfigurations such as incorrect rules, VLAN misallocation and mismatched keys, and (4) Middlebox failover is ineffective in about 33\% of the cases for load balancers and firewalls due to configuration bugs, faulty failovers and software version mismatch. Finally, we analyze current middlebox proposals based on our study and discuss directions for future research.

symposium on cloud computing | 2013

When the network crumbles: an empirical study of cloud network failures and their impact on services

Rahul Potharaju; Navendu Jain

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.

algorithmic game theory | 2011

A truthful mechanism for value-based scheduling in cloud computing

Navendu Jain; Ishai Menache; Joseph Naor; Jonathan Yaniv

We introduce a novel pricing and resource allocation approach for batch jobs on cloud systems. In our economic model, users submit jobs with a value function that specifies willingness to pay as a function of job due dates. The cloud provider in response allocates a subset of these jobs, taking into advantage the flexibility of allocating resources to jobs in the cloud environment. Focusing on social-welfare as the system objective (especially relevant for private or in-house clouds), we construct a resource allocation algorithm which provides a small approximation factor that approaches 2 as the number of servers increases. An appealing property of our scheme is that jobs are allocated nonpreemptively, i.e., jobs run in one shot without interruption. This property has practical significance, as it avoids significant network and storage resources for checkpointing. Based on this algorithm, we then design an efficient truthful-in-expectation mechanism, which significantly improves the running complexity of black-box reduction mechanisms that can be applied to the problem, thereby facilitating its implementation in real systems.

acm ifip usenix international conference on middleware | 2011

Scalable load balancing in cluster storage systems

Gae-won You; Seung-won Hwang; Navendu Jain

Enterprise and cloud data centers are comprised of tens of thousands of servers providing petabytes of storage to a large number of users and applications. At such a scale, these storage systems face two key challenges: (a) hot-spots due to the dynamic popularity of stored objects and (b) high reconfiguration costs of data migration due to bandwidth oversubscription in the data center network. Existing storage solutions, however, are unsuitable to address these challenges because of the large number of servers and data objects. This paper describes the design, implementation, and evaluation of Ursa, which scales to a large number of storage nodes and objects and aims to minimize latency and bandwidth costs during system reconfiguration. Toward this goal, Ursa formulates an optimization problem that selects a subset of objects from hot-spot servers and performs topology-aware migration to minimize reconfiguration costs. As exact optimization is computationally expensive, we devise scalable approximation techniques for node selection and efficient divide-and-conquer computation. Our evaluation shows Ursa achieves cost-effective load balancing while scaling to large systems and is time-responsive in computing placement decisions, e .g ., about two minutes for 10K nodes and 10M objects.

Explore More