Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Biswabandan Panda is active.

Publication


Featured researches published by Biswabandan Panda.


symposium on computer architecture and high performance computing | 2012

CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications

Biswabandan Panda; Shankar Balachandran

Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system.


international symposium on microarchitecture | 2016

Dictionary sharing: an efficient cache compression scheme for compressed caches

Biswabandan Panda; André Seznec

The effectiveness of a compressed cache depends on three features: i) the compression scheme, ii) the compaction scheme, and iii) the cache layout of the compressed cache. Skewed compressed cache (SCC) and yet another compressed cache (YACC) are two recently proposed compressed cache layouts that feature minimal storage and latency overheads, while achieving comparable performance over more complex compressed cache layouts. Both SCC and YACC use compression techniques to compress individual cache blocks, and then a compaction technique to compact multiple contiguous compressed blocks into a single data entry. The primary attribute used by these techniques for compaction is the compression factor of the cache blocks, and in this process, they waste cache space. In this paper, we propose dictionary sharing (DISH), a dictionary based cache compression scheme that reduces this wastage. DISH compresses a cache block by keeping in mind that the block is a potential candidate for the compaction process. DISH encodes a cache block with a dictionary that stores the distinct 4-byte chunks of a cache block and the dictionary is shared among multiple neighboring cache blocks. The simple encoding scheme of DISH also provides a single cycle decompression latency and it does not change the cache layout of compressed caches. Compressed cache layouts that use DISH outperforms the compression schemes, such as BDI and CPACK+Z, in terms of compression ratio (from 1.7X to 2.3X), system performance (from 7.2% to 14.3%), and energy efficiency (from 6% to 16%).


IEEE Computer Architecture Letters | 2016

Expert Prefetch Prediction: An Expert Predicting the Usefulness of Hardware Prefetchers

Biswabandan Panda; Shankar Balachandran

Hardware prefetching improves system performance by hiding and tolerating the latencies of lower levels of cache and off-chip DRAM. An accurate prefetcher improves system performance whereas an inaccurate prefetcher can cause cache pollution and consume additional bandwidth. Prefetch address filtering techniques improve prefetch accuracy by predicting the usefulness of a prefetch address and based on the outcome of the prediction, the prefetcher decides whether or not to issue a prefetch request. Existing techniques use only one signature to predict the usefulness of a prefetcher but no single predictor works well across all the applications. In this work, we propose weighted-majority filter, an expert way of predicting the usefulness of prefetch addresses. The proposed filter is adaptive in nature and uses the prediction of the best predictor(s) from a pool of predictors. Our filter is orthogonal to the underlying prefetching algorithm. We evaluate the effectiveness of our technique on 22 SPEC-2000/2006 applications. On an average, when employed with three state-of-the-art prefetchers such as AMPM, SMS, and GHB-PC/DC, our filter provides performance improvement of 8.1, 9.3, and 11 percent respectively.


international conference on parallel architectures and compilation techniques | 2013

TCPT: thread criticality-driven prefetcher throttling

Biswabandan Panda; Shankar Balachandran

A single parallel application running on a multicore system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Identifying critical threads and accelerating them can improve system performance. One of the metrics that correlate to thread criticality is the number of cache misses and the penalty associated with it. This paper proposes a throttling mechanism called TCPT which throttles hardware prefetchers by changing the prefetch degree based on the thread criticality.


IEEE Transactions on Computers | 2016

SPAC: A Synergistic Prefetcher Aggressiveness Controller for Multi-Core Systems

Biswabandan Panda

In multi-core systems, prefetch requests of one core interfere with the demand and prefetch requests of other cores at the shared resources, which causes prefetcher-caused interference. Prefetcher aggressiveness controllers play an important role in minimizing the prefetcher-caused interference. State-of-the-art controllers such as hierarchical prefetcher aggressiveness control (HPAC) select appropriate throttling levels that can lead to improvement in system performance. However, HPAC does not consider the interactions between the throttling decisions of multiple prefetchers, and loses opportunity to improve system performance further. For multi-core systems, state-of-the-art prefetcher aggressiveness controllers controls the aggressiveness based on prefetch metrics such as accuracy, bandwidth consumption and cache pollution. We propose a synergistic prefetcher aggressiveness controller (SPAC), which explores the interactions between the throttling decisions of prefetchers, and throttles the prefetchers based on the improvement in fair-speedup of multi-core systems.


international conference on parallel architectures and compilation techniques | 2014

XStream: cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs

Biswabandan Panda; Shankar Balachandran

Hardware prefetchers are commonly used to hide and tolerate off-chip memory latency. Prefetching techniques in the literature are designed for multiple independent sequential applications running on a multicore system. In contrast to multiple independent applications, a single parallel application running on a multicore system exhibits different behavior. In case of a parallel application, cores share and communicate data and code among themselves, and there is commonality in the demand miss streams across multiple cores. This gives an opportunity to predict the demand miss streams and communicate the predicted streams from one core to another, which we refer as cross-core stream communication. We propose cross-core spatial streaming (XStream), a practical and storage-efficient cross-core prefetching technique. XStream detects and predicts the cross-core spatial streams at the private mid level caches (MLCs) and sends the predicted streams in advance to MLC prefetchers of the predicted cores. We compare the effectiveness of XStream with the ideal cross-core spatial streamer. Experimental results demonstrate that, on an average (geomean), compared to the state-of-the-art spatial memory streaming, storage efficient XStream reduces the execution time by 11.3% (as high as 24%) and 9% (as high as 29.09%) for 4-core and 8-core systems respectively.


design, automation, and test in europe | 2014

Introducing Thread Criticality awareness in Prefetcher Aggressiveness Control

Biswabandan Panda; Shankar Balachandran

A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent cache misses and (3) effect of synchronization primitives. Identifying critical threads and minimizing their cache miss latencies can improve the overall execution time of a program. One way to hide and tolerate the cache misses is through hardware prefetching. Hardware prefetching is one of the most commonly used memory latency hiding techniques. Previous studies have shown the effectiveness of hardware prefetchers for multiprogrammed workloads (multiple sequential applications running independently on different cores). In contrast to multiprogrammed workloads, the performance of a single parallel application depends on the progress of slow progress(critical) threads. This paper introduces a prefetcher aggressiveness control mechanism called Thread Criticality-aware Prefetcher Aggressiveness Control (TCPAC). TCPAC controls the aggressiveness of the prefetchers at the L2 prefetching controllers (known as TCPAC-P), DRAM controller (known as TCPAC-D) and at the Last Level Cache (LLC) controller (known as TCPAC-C) using prefetch accuracy and thread progress. Each TCPAC sub-technique outperforms the respective state-of-the-art techniques such as HPAC [2], PADC [4], and PACMan [3] and the combination of all the TCPAC sub-techniques named as TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan. On an average, on a 8 core system, in terms of improvement in execution time, TCPAC-PDC outperforms the combination of HPAC, PADC, and PACMan by 7.61%. For 12 and 16 cores, TCPAC-PDC beats the state-of-the-art combinations by 7.21% and 8.32% respectively.


international conference on parallel architectures and compilation techniques | 2012

Hardware prefetchers for emerging parallel applications

Biswabandan Panda; Shankar Balachandran

Hardware prefetching has been studied in the past for multiprogrammed workloads as well as GPUs. Efficient hardware prefetchers like stream-based or GHB-based ones work well for multiprogrammed workloads because different programs get mapped to different cores and are run independently. Parallel applications, however, pose a different set of challenges. Multiple threads of a parallel application share data with each other which brings in coherency issues. Also, local prefetchers do not understand the irregular spread of misses across threads. In this paper, we propose a hardware prefetching framework for L1 D-Cache that targets parallel applications. We show how to make efficient prefetch requests to the L2 cache by studying and classifying the patterns of L1 misses across all the threads. Our preliminary results show an improvement of 7% in execution time on an average on the PARSEC benchmark suite.


IEEE Transactions on Computers | 2016

PBC: Prefetched Blocks Compaction

Kanakagiri Raghavendra; Biswabandan Panda; Madhu Mutyam

Cache compression improves the performance of a multi-core system by being able to store more cache blocks in a compressed format. Compression is achieved by exploiting data patterns present within a block. For a given cache space, compression increases the effective cache capacity. However, this increase is limited by the number of tags that can be accommodated at the cache. Prefetching is another technique that improves system performance by fetching the cache blocks ahead of time into the cache and hiding the off-chip latency. Commonly used hardware prefetchers, such as stream and stride, fetch multiple contiguous blocks into the cache. In this paper we propose prefetched blocks compaction (PBC) wherein we exploit the data patterns present across these prefetched blocks. PBC compacts the prefetched blocks into a single block with a single tag, effectively increasing the cache capacity. We also modify the cache organization to access these multiple cache blocks residing in a single block without any need for extra tag look-ups. PBC improves the system performance by 11.1 percent with a maximum of 43.4 percent on a four-core system.


ACM Transactions on Architecture and Code Optimization | 2015

CAFFEINE: A Utility-Driven Prefetcher Aggressiveness Engine for Multicores

Biswabandan Panda; Shankar Balachandran

Aggressive prefetching improves system performance by hiding and tolerating off-chip memory latency. However, on a multicore system, prefetchers of different cores contend for shared resources and aggressive prefetching can degrade the overall system performance. The role of a prefetcher aggressiveness engine is to select appropriate aggressiveness levels for each prefetcher such that shared resource contention caused by prefetchers is reduced, thereby improving system performance. State-of-the-art prefetcher aggressiveness engines monitor metrics such as prefetch accuracy, bandwidth consumption, and last-level cache pollution. They use carefully tuned thresholds for these metrics, and when the thresholds are crossed, they trigger aggressiveness control measures. These engines have three major shortcomings: (1) thresholds are dependent on the system configuration (cache size, DRAM scheduling policy, and cache replacement policy) and have to be tuned appropriately, (2) there is no single threshold that works well across all the workloads, and (3) thresholds are oblivious to the phase change of applications. To overcome these shortcomings, we propose CAFFEINE, a model-based approach that analyzes the effectiveness of a prefetcher and uses a metric called net utility to control the aggressiveness. Our metric provides net processor cycles saved because of prefetching by approximating the cycles saved across the memory subsystem, from last-level cache to DRAM. We evaluate CAFFEINE across a wide range of workloads and compare it with the state-of-the-art prefetcher aggressiveness engine. Experimental results demonstrate that, on average (geomean), CAFFEINE achieves 9.5% (as much as 38.29%) and 11% (as much as 20.7%) better performance than the best-performing aggressiveness engine for four-core and eight-core systems, respectively.

Collaboration


Dive into the Biswabandan Panda's collaboration.

Top Co-Authors

Avatar

Shankar Balachandran

Indian Institute of Technology Madras

View shared research outputs
Top Co-Authors

Avatar

Madhu Mutyam

Indian Institute of Technology Madras

View shared research outputs
Top Co-Authors

Avatar

Kanakagiri Raghavendra

Indian Institute of Technology Madras

View shared research outputs
Researchain Logo
Decentralizing Knowledge