Kshitij Sudan
University of Utah
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kshitij Sudan.
international conference on parallel architectures and compilation techniques | 2010
Manu Awasthi; David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Al Davis
Modern processors such as Tileras Tile64, Intels Nehalem, and AMDs Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, at memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMDs HyperTransport™, or Intels Quick-Path Interconnect™. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular piece of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. To date, no prior work has examined the effects of data placement among multiple MCs in such systems. Future chip-multiprocessors are likely to comprise multiple MCs and an even larger number of cores. This trend will increase the memory access latency variation in these systems. Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of the physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. These policies yield average performance improvements of 17% for adaptive first-touch page-placement, and 35% for a dynamic page-migration policy.
high-performance computer architecture | 2009
Manu Awasthi; Kshitij Sudan; Rajeev Balasubramonian; John B. Carter
In future multi-cores, large amounts of delay and power will be spent accessing data in large L2/L3 caches. It has been recently shown that OS-based page coloring allows a non-uniform cache architecture (NUCA) to provide low latencies and not be hindered by complex data search mechanisms. In this work, we extend that concept with mechanisms that dynamically move data within caches. The key innovation is the use of a shadow address space to allow hardware control of data placement in the L2 cache while being largely transparent to the user application and off-chip world. These mechanisms allow the hardware and OS to dynamically manage cache capacity per thread as well as optimize placement of data shared by multiple threads. We show an average IPC improvement of 10-20% for multi-programmed workloads with capacity allocation policies and an average IPC improvement of 8% for multi-threaded workloads with policies for shared page placement.
high performance computer architecture | 2012
Manu Awasthi; Manjunath Shevgoor; Kshitij Sudan; Bipin Rajendran; Rajeev Balasubramonian; Viii Srinivasan
Many memory cell technologies are being considered as possible replacements for DRAM and Flash technologies, both of which are nearing their scaling limits. While these new cells (PCM, STT-RAM, FeRAM, etc.) promise high density, better scaling, and non-volatility, they introduce new challenges. Solutions at the architecture level can help address some of these problems; e.g., prior research has proposed wear-leveling and hard error tolerance mechanisms to overcome the limited write endurance of PCM cells. In this paper, we focus on the soft error problem in PCM, a topic that has received little attention in the architecture community. Soft errors in DRAM memories are typically addressed by having SECDED support and a scrub mechanism. The scrub mechanism scans the memory looking for a single-bit error and corrects it before the line experiences a second uncorrectable error. However, PCM (and other emerging memories) are prone to new sources of soft errors. In particular, multi-level cell (MLC) PCM devices will suffer from resistance drift, that increases the soft error rate and incurs high overheads for the scrub mechanism. This paper is the first to study the design of architectural scrub mechanisms, especially when tailored to the drift phenomenon in MLC PCM. Many of our solutions will also apply to other soft-error prone emerging memories. We first show that scrub overheads can be reduced with support for strong ECC codes and a lightweight error detection operation. We then design different scrub algorithms that can adaptively trade-off soft and hard errors. Using an approach that combines all proposed solutions, our scrub mechanism yields a 96.5% reduction in uncorrectable errors, a 24.4 × decrease in scrub-related writes, and a 37.8% reduction in scrub energy, relative to a basic scrub algorithm used in modern DRAM systems.
international conference on parallel architectures and compilation techniques | 2012
Kshitij Sudan; Sadagopan Srinivasan; Rajeev Balasubramonian; Ravi R. Iyer
Co-location of applications is a proven technique to improve hardware utilization. Recent advances in virtualization have made co-location of independent applications on shared hardware a common scenario in datacenters. Co-location, while maintaining Quality-of-Service (QoS) for each application is a complex problem that is fast gaining relevance for these datacenters. The problem is exacerbated by the need for effective resource utilization at datacenter scales. In this work, we show that the memory system is a primary bottleneck in many workloads and is a more effective focal point when enforcing QoS. We examine four different memory system levers to enforce QoS: two that have been previously proposed, and two novel levers. We compare the effectiveness of each lever in minimizing power and resource needs, while enforcing QoS guarantees. We also evaluate the effectiveness of combining various levers and show that this combined approach can yield power reductions
international symposium on computer architecture | 2010
David W. Nellans; Kshitij Sudan; Erik Brunvand; Rajeev Balasubramonian
Modern and future server-class processors will incorporate many cores. Some studies have suggested that it may be worthwhile to dedicate some of the many cores for specific tasks such as operating system execution. OS off-loading has two main benefits: improved performance due to better cache utilization and improved power efficiency due to smarter use of heterogeneous cores. However, OS off-loading is a complex process that involves balancing the overheads of off-loading against the potential benefit, which is unknown while making the off-loading decision. In prior work, OS off-loading has been implemented by first profiling system call behavior and then manually instrumenting some OS routines (out of hundreds) to support off-loading. We propose a hardware-based mechanism to help automate the off-load decision-making process, and provide high quality dynamic decisions via performance feedback. Our mechanism dynamically estimates the off-load requirements of the application and relies on a run-length predictor for the upcoming OS system call invocation. The resulting hardware based off-loading policy yields a throughput improvement of up to 18% over a baseline without off-loading, 13% over a static software based policy, and 23% over a dynamic software based policy.
high-performance computer architecture | 2013
Kshitij Sudan; Saisanthosh Balakrishnan; Sean Lie; Min Xu; Dhiraj Mallick; Gary Lauterbach; Rajeev Balasubramonian
Large web-scale applications typically use a distributed platform, like clusters of commodity servers, to achieve scalable and low-cost processing. The Map-Reduce framework and its open-source implementation, Hadoop, is commonly used to program these applications. Since these applications scale well with an increased number of servers, the cluster size is an important parameter. Cluster size however is constrained by power consumption. In this paper we present a system that uses low-power CPUs to increase the cluster size in a fixed power budget. Using low-power CPUs leads to the situation where the majority of a servers power is now consumed by the I/O sub-system. To overcome this, we develop a virtualized I/O sub-system where multiple servers share I/O resources. An ASIC based high-bandwidth interconnect fabric, and FPGA based I/O cards implement this virtualized I/O. The resulting system is the first production quality implementation of cluster-in-a-box that uses low-power CPUs. The unique design demonstrates a way to build systems using low-power CPUs, allowing a much larger number of servers in a cluster in the same power envelope. To overcome software inefficiency and increase the utilization of virtualized disk bandwidth, optimizations necessary for the operating system are also discussed. We built hardware based on these ideas and experiments on this system show a 3X average improvement in performance-per-Watt-hour compared to a commodity cluster with the same power budget.
International Journal of Parallel Programming | 2012
Manu Awasthi; David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Al Davis
Modern processors such as Tilera’s Tile64, Intel’s Nehalem, and AMD’s Opteron are migrating memory controllers (MCs) on-chip, while maintaining a large, flat memory address space. This trend to utilize multiple MCs will likely continue and a core or socket will consequently need to route memory requests to the appropriate MC via an inter- or intra-socket interconnect fabric similar to AMD’s HyperTransportTM, or Intel’s Quick-Path InterconnectTM. Such systems are therefore subject to non-uniform memory access (NUMA) latencies because of the time spent traveling to remote MCs. Each MC will act as the gateway to a particular region of the physical memory. Data placement will therefore become increasingly critical in minimizing memory access latencies. Increased competition for memory resources will also increase the memory access latency variation in future systems. Proper allocation of workload data to the appropriate MC will be important in decreasing the variation and average latency when servicing memory requests. The allocation strategy will need to be aware of queuing delays, on-chip latencies, and row-buffer hit-rates for each MC. In this paper, we propose dynamic mechanisms that take these factors into account when placing data in appropriate slices of physical memory. We introduce adaptive first-touch page placement, and dynamic page-migration mechanisms to reduce DRAM access delays for multi-MC systems. We also introduce policies that can handle data placement in memory systems that have regions with heterogeneous properties. The proposed policies yield average performance improvements of 6.5% for adaptive first-touch page-placement, and 8.9% for a dynamic page-migration policy for a system with homogeneous DRAM DIMMs. We also show improvements in systems that contain DIMMs with different performance characteristics.
international symposium on performance analysis of systems and software | 2010
David W. Nellans; Kshitij Sudan; Rajeev Balasubramonian; Erik Brunvand
In the past ten years, computer architecture has seen a paradigm shift from emphasizing single thread performance to energy efficient, throughput oriented, chip multiprocessors. Several studies have suggested that it may be worthwhile to off-load execution of the operating system (OS) to one or more of these cores, or reconfigure hardware during OS execution. To be effective, these techniques must balance the cost of off-loading or re-configuration, versus the potential benefits, which are typically unknown at decision time. These decision points are typically implemented by manually instrumenting a few OS routines (out of hundreds). Such a preliminary research effort cannot be sustained across several operating systems and hardware configurations. We argue that decisions made in software are often sub-optimal because they are expensive in terms of run-time overhead and because applications vary in their use of OS features. We propose that these decision mechanisms should be supported through a hardware based OS run-length predictor, that removes the onus from OS developers. Our final design results in a 95% prediction accuracy for OS intensive applications, while requiring only 2 KB of storage.
international conference on parallel architectures and compilation techniques | 2011
Gagandeep S. Sachdev; Kshitij Sudan; Mary W. Hall; Rajeev Balasubramonian
Future scalable multi-core chips are expected to implement a shared last-level cache (LLC) with banks distributed on chip, forcing a core to incur non-uniform access latencies to each bank. Consequently, high performance and energy efficiency depend on whether a threads data is placed in local or nearby banks. Using compiler and programmer support, we aim to find an alternative solution to existing high-overhead designs. In this paper, we take existing parallel programs written in Pthreads, and show the performance gap between current static mapping schemes, costly migration schemes and idealized static and dynamic best-case scenarios.
architectural support for programming languages and operating systems | 2010
Kshitij Sudan; Niladrish Chatterjee; David W. Nellans; Manu Awasthi; Rajeev Balasubramonian; Al Davis