Indrani Paul | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Indrani Paul is active.

Explore More

Publication

Featured researches published by Indrani Paul.

international symposium on computer architecture | 2013

Cooperative boosting: needy versus greedy power management

Indrani Paul; Srilatha Manne; Manish Arora; W. Lloyd Bircher; Sudhakar Yalamanchili

This paper examines the interaction between thermal management techniques and power boosting in a state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. We show that for classes of applications that utilize both the CPU and the GPU, modern boost algorithms that greedily seek to convert thermal headroom into performance can interact with thermal coupling effects between the CPU and the GPU to degrade performance. We first examine the causes of this behavior and explain the interaction between thermal coupling, performance coupling, and workload behavior. Then we propose a dynamic power-management approach called cooperative boosting (CB) to allocate power dynamically between CPU and GPU in a manner that balances thermal coupling against the needs of performance coupling to optimize performance under a given thermal constraint. Through real hardware-based measurements, we evaluate CB against a state-of-the-practice boost algorithm and show that overall application performance and power savings increase by 10% and 8% (up to 52% and 34%), respectively, resulting in average energy efficiency improvement of 25% (up to 76%) over a wide range of benchmarks.

international performance computing and communications conference | 2012

Performance impact of virtual machine placement in a datacenter

Indrani Paul; Sudhakar Yalamanchili; Lizy Kurian John

In virtualized systems, several Virtual Machines (VM) running on a single hardware platform share and compete for the hardware resources such as memory, disk and network IO to meet a certain Quality of Service (QoS) requirements. It is critical to characterize and understand how the different workloads running in the VMs interact and share such resources to be able to map them efficiently onto processor cores and server hosts for optimal performance. This is especially important for resources such as memory controllers or the on-chip or inter-socket networks for which there is currently no software control. In this paper, we present a measurement-based performance analysis of server virtualization workloads from a real system using virtual machines that are part of the popular industry standard VMmark benchmark, a server consolidation benchmark. First, we characterize the relative resource contention and interference impact of VMs when multiple virtual workloads are run together. Second, we study the effects of co-locating different types of VMs under various VM to core placement schemes and discover the best placement for performance. We observe performance variations from 25-65% for Database servers and from 7-40% for File servers when compared to standalone VM depending on the placement of these VMs onto cores and the degree of sharing of resources. Finally, we propose an interference metric and regression model for the worst set of co-located VMs in our study. Based on different VM placement schemes we show that the overall server consolidation performance in a virtualized host can be improved by 8% when the VMs are placed effectively.

IEEE Micro | 2015

Achieving Exascale Capabilities through Heterogeneous Computing

Michael J. Schulte; Mike Ignatowski; Gabriel H. Loh; Bradford M. Beckmann; William C. Brantley; Sudhanva Gurumurthi; Nuwan Jayasena; Indrani Paul; Steven K. Reinhardt; Gregory Rodgers

This article provides an overview of AMDs vision for exascale computing, and in particular, how heterogeneity will play a central role in realizing this vision. Exascale computing requires high levels of performance capabilities while staying within stringent power budgets. Using hardware optimized for specific functions is much more energy efficient than implementing those functions with general-purpose cores. However, there is a strong desire for supercomputer customers not to have to pay for custom components designed only for high-end high-performance computing systems. Therefore, high-volume GPU technology becomes a natural choice for energy-efficient data-parallel computing. To fully realize the GPUs capabilities, the authors envision exascale computing nodes that compose integrated CPUs and GPUs (that is, accelerated processing units), along with the hardware and software support to enable scientists to effectively run their scientific experiments on an exascale system. The authors discuss the hardware and software challenges in building a heterogeneous exascale system and describe ongoing research efforts at AMD to realize their exascale vision.

international symposium on computer architecture | 2015

Harmonia: balancing compute and memory power in high-performance GPUs

Indrani Paul; Wei Huang; Manish Arora; Sudhakar Yalamanchili

In this paper, we address the problem of efficiently managing the relative power demands of a high-performance GPU and its memory subsystem. We develop a management approach that dynamically tunes the hardware operating configurations to maintain balance between the power dissipated in compute versus memory access across GPGPU application phases. Our goal is to reduce power with minimal performance degradation. Accordingly, we construct predictors that assess the online sensitivity of applications to three hardware tunables-compute frequency, number of active compute units, and memory bandwidth. Using these sensitivity predictors, we propose a two-level coordinated power management scheme, Harmonia, which coordinates the hardware power states of the GPU and the memory system. Through hardware measurements on a commodity GPU, we evaluate Harmonia against a state-of-the-practice commodity GPU power management scheme, as well as an oracle scheme. Results show that Harmonia improves measured energy-delay squared (ED2) by up to 36% (12% on average) with negligible performance loss across representative GPGPU workloads, and on an average is within 3% of the oracle scheme.

ieee international conference on high performance computing data and analytics | 2013

Coordinated energy management in heterogeneous processors

Indrani Paul; Vignesh T. Ravi; Srilatha Manne; Manish Arora; Sudhakar Yalamanchili

This paper examines energy management in a heterogeneous processor consisting of an integrated CPU-GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types - a new and less understood problem. We examine the intra-node CPU-GPU frequency sensitivity of HPC applications on tightly coupled CPU-GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU-GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED^2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

Scientific Programming | 2014

Coordinated Energy Management in Heterogeneous Processors

Indrani Paul; Vignesh T. Ravi; Srilatha Manne; Manish Arora; Sudhakar Yalamanchili

This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

high-performance computer architecture | 2017

Design and Analysis of an APU for Exascale Computing

Thiruvengadam Vijayaraghavany; Yasuko Eckert; Gabriel H. Loh; Michael J. Schulte; Mike Ignatowski; Bradford M. Beckmann; William C. Brantley; Joseph L. Greathouse; Wei Huang; Arun Karunanithi; Onur Kayiran; Mitesh R. Meswani; Indrani Paul; Matthew Poremba; Steven E. Raasch; Steven K. Reinhardt; Greg Sadowski; Vilas Sridharan

The challenges to push computing to exaflop levels are difficult given desired targets for memory capacity, memory bandwidth, power efficiency, reliability, and cost. This paper presents a vision for an architecture that can be used to construct exascale systems. We describe a conceptual Exascale Node Architecture (ENA), which is the computational building block for an exascale supercomputer. The ENA consists of an Exascale Heterogeneous Processor (EHP) coupled with an advanced memory system. The EHP provides a high-performance accelerated processing unit (CPU+GPU), in-package high-bandwidth 3D memory, and aggressive use of die-stacking and chiplet technologies to meet the requirements for exascale computing in a balanced manner. We present initial experimental analysis to demonstrate the promise of our approach, and we discuss remaining open research challenges for the community.

ieee international symposium on workload characterization | 2015

A Taxonomy of GPGPU Performance Scaling

Abhinandan Majumdar; Gene Y. Wu; Kapil Dev; Joseph L. Greathouse; Indrani Paul; Wei Huang; Arjun-Karthik Venugopal; Leonardo Piga; Chip Freitag; Sooraj Puthoor

Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.

measurement and modeling of computer systems | 2014

A comparison of core power gating strategies implemented in modern hardware

Manish Arora; Srilatha Manne; Yasuko Eckert; Indrani Paul; Nuwan Jayasena; Dean M. Tullsen

Idle power is a significant contributor to overall energy consumption in modern multi-core processors. Cores can enter a full-sleep state, also known as C6, to reduce idle power; however, entering C6 incurs performance and power overheads. Since power gating can result in negative savings, hardware vendors implement various algorithms to manage C6 entry. In this paper, we examine state-of-the-art C6 entry algorithms and present a comparative analysis in the context of consumer and CPU-GPU benchmarks.

ieee international symposium on workload characterization | 2016

Measuring and modeling on-chip interconnect power on real hardware

Vignesh Adhinarayanan; Indrani Paul; Joseph L. Greathouse; Wei Huang; Ashutosh Pattnaik; Wu-chun Feng

On-chip data movement is a major source of power consumption in modern processors, and future technology nodes will exacerbate this problem. Properly understanding the power that applications expend moving data is vital for inventing mitigation strategies. Previous studies combined data movement energy, which is required to move information across the chip, with data access energy, which is used to read or write onchip memories. This combination can hide the severity of the problem, as memories and interconnects will scale differently to future technology nodes. Thus, increasing the fidelity of our energy measurements is of paramount concern. We propose to use physical data movement distance as a mechanism for separating movement energy from access energy. We then use this mechanism to design microbenchmarks to ascertain data movement energy on a real modern processor. Using these microbenchmarks, we study the following parameters that affect interconnect power: (i) distance, (ii) interconnect bandwidth, (iii) toggle rate, and (iv) voltage and frequency. We conduct our study on an AMD GPU built in 28nm technology and validate our results against industrial estimates for energy/bit/millimeter. We then construct an empirical model based on our characterization and use it to evaluate the interconnect power of 22 real-world applications. We show that up to 14% of the dynamic power in some applications can be consumed by the interconnect and present a range of mitigation strategies.

Explore More