Is this you? Create Your Porfile

Harshad Kasture

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harshad Kasture is active.

Explore More

Publication

Featured researches published by Harshad Kasture.

high-performance computer architecture | 2010

Graphite: A distributed parallel simulator for multicores

Jason E. Miller; Harshad Kasture; George Kurian; Charles Gruenwald; Nathan Beckmann; Christopher Celio; Jonathan Eastep; Anant Agarwal

This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.

architectural support for programming languages and operating systems | 2014

Ubik: efficient cache sharing with strict qos for latency-critical workloads

Harshad Kasture; Daniel Sanchez

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.

international symposium on microarchitecture | 2015

Rubik: fast analytical power management for latency-critical systems

Harshad Kasture; Davide Basilio Bartolini; Nathan Beckmann; Daniel Sanchez

Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a co-location scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces data-center power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.

ieee international symposium on workload characterization | 2016

Tailbench: a benchmark suite and evaluation methodology for latency-critical applications

Harshad Kasture; Daniel Sanchez

Latency-critical applications, common in datacenters, must achieve small and predictable tail (e.g., 95th or 99th percentile) latencies. Their strict performance requirements limit utilization and efficiency in current datacenters. These problems have sparked research in hardware and software techniques that target tail latency. However, research in this area is hampered by the lack of a comprehensive suite of latency-critical benchmarks. We present TailBench, a benchmark suite and evaluation methodology that makes latency-critical workloads as easy to run and characterize as conventional, throughput-oriented ones. TailBench includes eight applications that span a wide range of latency requirements and domains, and a harness that implements a robust and statistically sound load-testing methodology. The modular design of the TailBench harness facilitates multiple load-testing scenarios, ranging from multi-node configurations that capture network overheads, to simplified single-node configurations that allow measuring tail latency in simulation. Validation results show that the simplified configurations are accurate for most applications. This flexibility enables rapid prototyping of hardware and software techniques for latency-critical workloads.

design automation conference | 2012

The case for elastic operating system services in fos

Lamia Youseff; Charles Gruenwald; Nathan Beckmann; David Wentzlaff; Harshad Kasture; Anant Agarwal

Given exponential scaling, it will not be long before chips with hundreds of cores are standard. For OS designers, this new trend in architectures provides a new opportunity to explore different research directions in scaling operating systems. The primary question facing OS designers over the next ten years will be: What is the correct design of OS services that will scale up to hundreds or thousands of cores, and adapt to the unprecedented variability in demand of the system resources? A fundamental research challenge addressed in this paper is to identify some characteristics of such a scalable OS service for next multicore and cloud computing chips. We argue that the OS services have to deploy elastic techniques to adapt to this variability at runtime. In this paper, we advocate for elastic OS service, illustrate their feasibility and effectiveness in meeting the variable demands through providing elastic technologies for OS services in the fos operating system. We furthermore showcase a prototype elastic file system service fos and illustrate its effectiveness in meeting variable demands.

international conference on parallel architectures and compilation techniques | 2017

POSTER: Improving Datacenter Efficiency Through Partitioning-Aware Scheduling

Harshad Kasture; Xu Ji; Nosayba El-Sayed; Nathan Beckmann; Xiaosong Ma; Daniel Sanchez

Datacenter servers often colocate multiple applications to improve utilization and efficiency. However, colocated applications interfere in shared resources, e.g., the last-level cache (LLC) and DRAM bandwidth, causing performance inefficiencies. Prior work has proposed two disjoint approaches to address interference. First, techniques that partition shared resources like the LLC can provide isolation and trade performance among colocated applications within a single node. But partitioning techniques are limited by the fixed resource demands of the applications running on the node. Second, interference-aware schedulers try to find resource-compatible applications and schedule them across nodes to improve performance. But prior schedulers are hampered by the lack of partitioning hardware in conventional multicores, and are forced to take conservative colocation decisions, leaving significant performance on the table. We show that memory-system partitioning and scheduling are complementary, and performing them in a coordinated fashion yields significant benefits. We present Shepherd, a joint scheduler and resource partitioner that seeks to maximize cluster-wide throughput. Shepherd uses detailed application profiling data to partition the shared LLC and to estimate the impact of DRAM bandwidth contention among colocated applications. Shepherds scheduler leverages this information to colocate applications with complementary resource requirements, improving resource utilization and cluster throughput. We evaluate Shepherd in simulation and on a real cluster with hardware support for cache partitioning. When managing mixes of server and scientific applications, Shepherd improves cluster throughput over an unpartitioned system by 38% on average.

Archive | 2009