Arkaprava Basu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arkaprava Basu is active.

Explore More

Publication

Featured researches published by Arkaprava Basu.

ACM Sigarch Computer Architecture News | 2011

The gem5 simulator

Nathan L. Binkert; Bradford M. Beckmann; Gabriel Black; Steven K. Reinhardt; Ali G. Saidi; Arkaprava Basu; Joel Hestness; Derek R. Hower; Tushar Krishna; Somayeh Sardashti; Rathijit Sen; Korey Sewell; Muhammad Shoaib; Nilay Vaish; Mark D. Hill; David A. Wood

The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86). The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.

international symposium on microarchitecture | 2013

Heterogeneous system coherence for integrated CPU-GPU systems

Jason Power; Arkaprava Basu; Junli Gu; Sooraj Puthoor; Bradford M. Beckmann; Mark D. Hill; Steven K. Reinhardt; David A. Wood

Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches. Meanwhile, region coherence has been proposed for CPU-only systems to reduce snoop bandwidth by obtaining coherence permissions for large regions. This paper develops Heterogeneous System Coherence (HSC) for CPU-GPU systems to mitigate the coherence bandwidth effects of GPU memory requests. HSC replaces a standard directory with a region directory and adds a region buffer to the L2 cache. These structures allow the system to move bandwidth from the coherence network to the high-bandwidth direct-access bus without sacrificing coherence. Evaluation results with a subset of Rodinia benchmarks and the AMD APP SDK show that HSC can improve performance compared to a conventional directory protocol by an average of more than 2× and a maximum of more than 4.5×. Additionally, HSC reduces the bandwidth to the directory by an average of 94% and by more than 99% for four of the analyzed benchmarks.

international symposium on computer architecture | 2012

Reducing memory reference energy with opportunistic virtual caching

Arkaprava Basu; Mark D. Hill; Michael M. Swift

Most modern cores perform a highly-associative transaction look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissi-pated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With todays concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only after L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86s physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVCs promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.

international conference on supercomputing | 2011

Karma: scalable deterministic record-replay

Arkaprava Basu; Jayaram Bobba; Mark D. Hill

Recent research in deterministic record-replay seeks to ease debugging, security, and fault tolerance on otherwise nondeterministic multicore systems. The important challenge of handling shared memory races (that can occur on any memory reference) can be made more efficient with hardware support. Recent proposals record how long threads run in isolation on top of snooping coherence (IMRR), implicit transactions (DeLorean), or directory coherence (Rerun). As core counts scale, Reruns directory-based parallel record gets more attractive, but its nearly sequential replay becomes unacceptably slow. This paper proposes Karma for both scalable recording and replay. Karma builds an episodic memory race recorder using a conventional directory cache coherence protocol and records the order of the episodes as a directed acyclic graph. Karma also enables extension of episodes even after some conflicts. During replay, Karma uses wakeup messages to trigger a partially ordered parallel episode replay. Results with several commercial workloads on a 16-core system show that Karma can achieve replay speed (a) within 19%-28% of native execution speed without record-replay and (b) four times faster than even an idealized Rerun replay. Additional results explore tradeoffs between log size and replay speed.

international symposium on microarchitecture | 2014

Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Jayneel Gandhi; Arkaprava Basu; Mark D. Hill; Michael M. Swift

Virtualization provides value for many workloads, but its cost rises for workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 memory references on x86-64) rather than a native TLB miss (up to only 4 memory references). The first dimension translates guest virtual addresses to guest physical addresses, while the second translates guest physical addresses to host physical addresses. This paper proposes new hardware using direct segments with three new virtualized modes of operation that significantly speed-up virtualized address translation. Further, this paper proposes two novel techniques to address important limitations of original direct segments. First, self-ballooning reduces fragmentation in physical memory, and addresses the architectural input/output (I/O) gap in x86-64. Second, an escape filter provides alternate translations for exceptional pages within a direct segment (e.g., Physical pages with permanent hard faults). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. One mode -- VMM Direct -- reduces address translation overhead to near-native without guest application or OS changes (2% slower than native on average), while a more aggressive mode -- Dual Direct -- on big-memory workloads performs better-than-native with near-zero translation overhead.

international symposium on performance analysis of systems and software | 2016

Observations and opportunities in architecting shared virtual memory for heterogeneous systems

Jan Vesely; Arkaprava Basu; Mark Oskin; Gabriel H. Loh; Abhishek Bhattacharjee

Computing is becoming increasingly heterogeneous with accelerators like GPUs being tightly integrated with CPUs on the same die. Extending the CPUs virtual addressing mechanism to these accelerators is a key step in making accelerators easily programmable. In this work, we analyze, using real-system measurements, shared virtual memory across the CPU and an integrated GPU. We make several key observations and highlight consequent research opportunities: (1) servicing a TLB miss from the GPU can be an order of magnitude slower than that from the CPU and consequently it is imperative to enable many concurrent TLB misses to hide this larger latency; (2) divergence in memory accesses impacts the GPUs address translation more than the rest of the memory hierarchy, and research in designing address translation mechanisms tolerant to this effect is imperative; and (3) page faults from the GPU are considerably slower than that from the CPU and software-hardware co-design is essential for efficient implementation of page faults from throughput-oriented accelerators like GPUs. We present a detailed measurement study of a commercially available integrated APU that illustrates these effects and motivates future research opportunities.

ACM Sigarch Computer Architecture News | 2014

BadgerTrap: a tool to instrument x86-64 TLB misses

Jayneel Gandhi; Arkaprava Basu; Mark D. Hill; Michael M. Swift

The overheads of memory management units (MMUs) have gained importance in todays systems. Detailed simulators may be too slow to gain insights into micro-architectural techniques that improve MMU efficiency. To address this issue, we propose a novel tool, BadgerTrap, which allows online instrumentation of TLB misses. It allows first-order analysis of new hardware techniques to improve MMU efficiency. The tool helps to create and analyze x86-64 TLB miss trace. We describe example studies to show various ways this tool can be applied to gain new research insights.

international conference on computer design | 2013

FreshCache: Statically and dynamically exploiting dataless ways

Arkaprava Basu; Derek R. Hower; Mark D. Hill; Michael M. Swift

Last level caches (LLCs) account for a substantial fraction of the area and power budget in many modern processors. Two recent trends - dwindling die yield that falls off sharply with larger chips and increasing static power - make a strong case for a fresh look at LLC design. Inclusive caches are particularly interesting because many commercially successful processors use inclusion to ease coherence at a cost of some data being stale or redundant. Prior works have demonstrated that LLC designs could be improved through static (at design time) or dynamic (at runtime) use of “dataless ways”. The static dataless ways removes the data-but not tags-from some cache ways to save energy and area without complicating inclusive-LLC coherence. A dynamic version (dynamic dataless ways) could dynamically turn off data, but not tags, effectively adapting the classic selective cache ways idea to save energy in LLC but not area. We find that (a) all our benchmarks benefit from dataless ways, but (b) the best number of dataless ways varies by workload. Thus, a pure static dataless design leaves energy-saving opportunity on the table, while a pure dynamic dataless design misses area-saving opportunity. To surpass both pure static and dynamic approaches, we develop the FreshCache LLC design that both statically and dynamically exploits dataless ways, including a predictor to adapt the number of dynamic dataless ways as well as detailed cache management policies. Results show that FreshCache saves more energy than static dataless ways alone (e.g., 72% vs. 9% of LLC) and more area by dynamic dataless ways only (e.g., 8% vs. 0% of LLC).

Proceedings of the Second International Symposium on Memory Systems | 2016

Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View

Shuai Che; Arkaprava Basu; Jonathan Gallmeier

Recently there has been significant development and innovation in both frontiers of Heterogeneous Memory and Heterogeneous Compute domains. This paper summarizes the challenges, surveys related work, and proposes possible research directions to exploit both heterogeneous memory and compute resources in a computer system. We focus our discussion on the memory system and also touch issues related to heterogeneous compute.

Proceedings of the Second International Symposium on Memory Systems | 2016

Software Assisted Hardware Cache Coherence for Heterogeneous Processors

Arkaprava Basu; Sooraj Puthoor; Shuai Che; Bradford M. Beckmann

Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically depends upon the ability to efficiently support cache coherence and shared virtual memory across tightly-integrated CPUs and GPUs. However, throughput-oriented GPUs easily overwhelm existing hardware coherence mechanisms that long kept the cache hierarchies in multi-core CPUs coherent. This paper proposes a novel solution called Software Assisted Hardware Coherence (SAHC) to scale cache coherence to future heterogeneous processors. We observe that the system software (Operating system and runtime) often has semantic knowledge about sharing patterns of data across the CPU and the GPU. This high-level knowledge can be utilized to effectively provide cache coherence across throughput-oriented GPUs and latency-sensitive CPUs in a heterogeneous processor. SAHC thus proposes a hybrid software-hardware mechanism that judiciously uses hardware coherence only when needed while using softwares knowledge to filter out most of the unnecessary coherence traffic. Our evaluation suggests that SAHC can often eliminate up to 98-100% of the hardware coherence lookups, resulting up to 49% reduction in runtime.

Explore More