Abhishek Bhattacharjee

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abhishek Bhattacharjee is active.

Explore More

Publication

Featured researches published by Abhishek Bhattacharjee.

international symposium on computer architecture | 2009

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Abhishek Bhattacharjee; Margaret Martonosi

With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intels Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBBs task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.

international symposium on microarchitecture | 2012

CoScale: Coordinating CPU and Memory System DVFS in Server Systems

Qingyuan Deng; David Meisner; Abhishek Bhattacharjee; Thomas F. Wenisch; Ricardo Bianchini

Recent work has introduced memory system dynamic voltage and frequency scaling (DVFS), and has suggested that balanced scaling of both CPU and the memory system is the most promising approach for conserving energy in server systems. In this paper, we first demonstrate that CPU and memory system DVFS often conflict when performed independently by separate controllers. In response, we propose Co Scale, the first method for effectively coordinating these mechanisms under performance constraints. Co Scale relies on execution profiling of each core via (existing and new) performance counters, and models of core and memory performance and power consumption. Co Scale explores the set of possible frequency settings in such a way that it efficiently minimizes the full-system energy consumption within the performance bound. Our results demonstrate that, by effectively coordinating CPU and memory power management, Co Scale conserves a significant amount of system energy compared to existing approaches, while consistently remaining within the prescribed performance bounds. The results also show that Co Scale conserves almost as much system energy as an offline, idealized approach.

high-performance computer architecture | 2011

Shared last-level TLBs for chip multiprocessors

Abhishek Bhattacharjee; Daniel Lustig; Margaret Martonosi

Translation Lookaside Buffers (TLBs) are critical to processor performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as chip multiprocessors (CMPs) become ubiquitous, TLB design must be re-evaluated. This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs. SLL TLBs eliminate 7–79% of system-wide misses for parallel workloads. This is an average of 27% better than conventional private, per-core L2 TLBs, translating to notable runtime gains. SLL TLBs also provide benefits comparable to recently-proposed Inter-Core Cooperative (ICC) TLB prefetchers, but with considerably simpler hardware. Furthermore, unlike these prefetchers, SLL TLBs can aid sequential applications, eliminating 35–95% of the TLB misses for various multiprogrammed combinations of sequential applications. This corresponds to a 21% average increase in TLB miss eliminations compared to private, per-core L2 TLBs. Because of their benefits for parallel and sequential applications, and their readily-implementable hardware, SLL TLBs hold great promise for CMPs.

international conference on parallel architectures and compilation techniques | 2009

Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors

Abhishek Bhattacharjee; Margaret Martonosi

Translation Lookaside Buffers (TLBs) are a staple in modern computer systems and have a significant impact on overall system performance. Numerous prior studies have addressed TLB designs to lower access times and miss rates; these, however, have been targeted towards uniprocessor architectures. As the computer industry embraces chip multiprocessor (CMP) architectures, it is important to study the TLB behavior of emerging parallel workloads. This work presents the first full-system characterization of the TLB behavior of emerging parallel applications on real-system CMPs. Using the PARSEC benchmarks, representative of emerging RMS workloads, we show that TLB misses can hinder system performance significantly. We also evaluate TLB miss stream patterns and show that multiple threads of a parallel execution experience a large number of redundant and predictable misses. For our evaluated benchmarks, 30% to 95% of the total misses fall under this category. Our results point to the need for novel TLB designs encouraging inter-core cooperation, either through hierarchically shared TLBs or through inter-core TLB prediction mechanisms.

architectural support for programming languages and operating systems | 2014

Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Bharath Pichai; Lisa Hsu; Abhishek Bhattacharjee

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory. To this end, we are the first to explore GPU Memory Management Units(MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems. We show the performance challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1 parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15\% of runtime). We presume this initial design leaves room for improvement but anticipate that our bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this fruitful area.

architectural support for programming languages and operating systems | 2010

Inter-core cooperative TLB for chip multiprocessors

Abhishek Bhattacharjee; Margaret Martonosi

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for uniprocessors. With the growing dominance of chip multiprocessors (CMPs), it is necessary to examine TLB performance in the context of parallel workloads. This work is the first to present TLB prefetchers that exploit commonality in TLB miss patterns across cores in CMPs. We propose and evaluate two Inter-Core Cooperative (ICC) TLB prefetching mechanisms, assessing their effectiveness at eliminating TLB misses both individually and together. Our results show these approaches require at most modest hardware and can collectively eliminate 19% to 90% of data TLB (D-TLB) misses across the surveyed parallel workloads. We also compare performance improvements across a range of hardware and software implementation possibilities. We find that while a fully-hardware implementation results in average performance improvements of 8-46% for a range of TLB sizes, a hardware/software approach yields improvements of 4-32%. Overall, our work shows that TLB prefetchers exploiting inter-core correlations can effectively eliminate TLB misses.

international symposium on microarchitecture | 2012

CoLT: Coalesced Large-Reach TLBs

Binh Pham; Viswanathan Vaidyanathan; Aamer Jaleel; Abhishek Bhattacharjee

Translation Look aside Buffers (TLBs) are critical to system performance, particularly as applications demand larger working sets and with the adoption of virtualization. Architectural support for super pages has previously been proposed to improve TLB performance. By allocating contiguous physical pages to contiguous virtual pages, the operating system (OS) constructs super pages which need just one TLB entry rather than the hundreds required for the constituent base pages. While this greatly reduces TLB misses, these gains are often offset by the implementation difficulties of generating and managing ample contiguity for super pages. We show, however, that basic OS memory allocation mechanisms such as buddy allocators and memory compaction naturally assign contiguous physical pages to contiguous virtual pages. Our real-system experiments show that while usually insufficient for super pages, these intermediate levels of contiguity exist under various system conditions and even under high load. In response, we propose Coalesced Large-Reach TLBs (CoLT), which leverage this intermediate contiguity to coalesce multiple virtual-to-physical page translations into single TLB entries. We show that CoLT implementations eliminate 40\% to 58\% of TLB misses on average, improving performance by 14\%. Overall, we demonstrate that the OS naturally generates page allocation contiguity. CoLT exploits this contiguity to eliminate TLB misses for next-generation, big-data applications with low-overhead implementations.

international symposium on microarchitecture | 2013

Large-reach memory management unit caches

Abhishek Bhattacharjee

Within the ever-important memory hierarchy, little research is devoted to Memory Management Unit (MMU) caches, implemented in modern processors to accelerate Translation Lookaside Buffer (TLB) misses. MMU caches play a critical role in determining system performance. This paper presents a measurement study quantifying the size of that role, and describes two novel optimizations to improve the performance of this structure on a range of sequential and parallel big-data workloads. The first is a software/hardware optimization that requires modest operating system (OS) and hardware support. In this approach, the OS allocates page table pages in ways that make them amenable for coalescing in MMU caches, increasing their hit rates. The second is a readily-implementable hardware-only approach, replacing standard per-core MMU caches with a single shared MMU cache of the same total area. Despite its additional access latencies, reduced miss rates greatly improve performance. The approaches are orthogonal; together, they achieve performance close to ideal MMU caches. Overall, this paper addresses the paucity of research on MMU caches. Our insights will assist the development of high-performance address translation support for systems running big-data applications.

international symposium on low power electronics and design | 2012

MultiScale: memory system DVFS with multiple memory controllers

Qingyuan Deng; David Meisner; Abhishek Bhattacharjee; Thomas F. Wenisch; Ricardo Bianchini

The fraction of server energy consumed by the memory system has been increasing rapidly and is now on par with that consumed by processors. Recent work demonstrates that substantial memory energy can be saved with only a small, tightly-controlled performance degradation using memory Dynamic Frequency and Voltage Scaling (DVFS). Prior studies consider only servers with a single memory controller (MC); however, multicore server processors have begun to incorporate multiple MCs. We propose MultiScale, the first technique to coordinate DVFS across multiple MCs, memory channels, and memory devices. Under operating system control, MultiScale monitors application bandwidth requirements across MCs. It then uses a heuristic algorithm to select and apply a frequency combination that will minimize the overall system energy within user-specified per-application performance constraints. Our results demonstrate that MultiScale reduces system energy consumption significantly, compared to prior approaches, while respecting the user-specified performance constraints.

high-performance computer architecture | 2014

Increasing TLB reach by exploiting clustering in page translations

Binh Pham; Abhishek Bhattacharjee; Yasuko Eckert; Gabriel H. Loh

The steadily increasing sizes of main memory capacities require corresponding increases in the processors translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit “contiguous” spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce “adjacent” TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit “clustered” spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.

Explore More