Is this you? Create Your Porfile

Shirshendu Das

Indian Institute of Technology Guwahati

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shirshendu Das is active.

Explore More

Publication

Featured researches published by Shirshendu Das.

Microprocessors and Microsystems | 2014

Victim retention for reducing cache misses in tiled chip multiprocessors

Shirshendu Das; Hemangee K. Kapoor

This paper presents CMP-VR (Chip-Multiprocessor with Victim Retention), an approach to improve cache performance by reducing the number of off-chip memory accesses. The objective of this approach is to retain the chosen victim cache blocks on the chip for the longest possible time. It may be possible that some sets of the CMPs last level cache (LLC) are heavily used, while certain others are not. In CMP-VR, some number of ways from every set are used as reserved storage. It allows a victim block from a heavily used set to be stored into the reserve space of another set. In this way the load of heavily used sets are distributed among the underused sets. This logically increases the associativity of the heavily used sets without increasing the actual associativity and size of the cache. Experimental evaluation using full-system simulation shows that CMP-VR has less off-chip miss-rate as compared to baseline Tiled CMP. Results are presented for different cache sizes and associativity for CMP-VR and baseline configuration. The best improvements obtained are 45.5% and 14% in terms of miss rate and cycles per instruction (CPI) respectively for a 4 MB, 4-way set associative LLC. Reduction in CPI and miss rate together guarantees performance improvement.

international symposium on electronic system design | 2013

Dynamic Associativity Management Using Fellow Sets

Shirshendu Das; Hemangee K. Kapoor

The memory accesses of to days applications are non-uniformly distributed across the cache sets and as a result some sets of the cache are heavily used while some other sets remain underutilized. This paper presents CMP-SVR, an approach to dynamically increase the associativity of heavily used sets without increasing the cache size. It divides the last level cache (LLC) into two sections: normal storage (NT) and reserve storage (RT). Some number of ways (25% to 50%) from each set are reserved for RT and the remaining ways belongs to NT. The sets are divided into some groups called fellow-groups and a set can use the reserve ways of its fellow sets to increase its associativity during execution. An additional tag-array (SA-TGS) for RT has been used to make the searching easier and less expensive. SA-TGS is almost like an N-way set associative cache where its associativity depends on the number of reserve ways per set and the number of sets in a fellow-group. CMP-SVR has less storage, area and energy overhead as compared to the other techniques proposed for dynamic associativity management. Full system simulation shows on average of 8% improvement in cycles per instruction (CPI) and 28% improvement in miss-rate.

international conference on vlsi design | 2015

Exploration of Migration and Replacement Policies for Dynamic NUCA over Tiled CMPs

Shirshendu Das; Hemangee K. Kapoor

Multicore processors have proliferated several domains ranging from small scale embedded systems to large data-centers, making tiled CMPs (TCMP) the essential next generation scalable architecture. More processors need proportionally large cache to support the concurrent applications. NUCA architectures help in managing the capacity and access time for such larger cache designs. Static NUCA (SNUCA) has a fixed address mapping policy whereas dynamic NUCA (DNUCA) allows blocks to relocate nearer to the processing cores at runtime. The DNUCA architectures are well explored for systems with centralized cache banks and having processors along the periphery. SNUCA is well understood and explored for tiled CMPs whereas the same is not the case for DNUCA. Towards exploring various implementations of DNUCA for tiled CMPs, this paper presents an architecture along with migration and replacement policies for tiled CMPs. Experimental results show improvements with respect to TCMP based SNUCA. We discuss the differences and challenges of TCMP based DNUCA compared to original DNUCA, and present results for different configurations of migration and replacement policies. Simulation results shows improvements over TCMP based SNUCA by 44.7% and 9.6% in terms of miss-rate and cycles per instruction (CPI) respectively.

acm symposium on applied computing | 2015

Static energy reduction by performance linked cache capacity management in tiled CMPs

Hemangee K. Kapoor; Shirshendu Das; Shounak Chakraborty

With the rapid growth in semiconductor technology, modern processor chips have multiple number of processor cores with multi-level on-chip caches. Recent study about the chip power consumption indicates that, the principal amount of chip power is consumed by the on chip caches which can be divided into two major parts: dynamic power and static power. Dynamic power is consumed when the cache is accessed and static power is generally referred as leakage power of the cache. This increased power consumption of chip increases chip-temperature which increases on chip leakage power. In this paper we attempt to reduce the static power consumption by intelligently powering off cache banks and mapping its requests to other active cache banks. We use a performance based criteria for the shutdown decision and the bank to be powered off is chosen based on usage statistics. The remapping of requests for a powered off cache bank is done at the L2-controller and thus the L1 caches are transparent to this approach. Thus, depending on the applications workload and data distribution, a controlled number of banks can be dynamically shutdown saving on the leakage power dissipation. Experimental analysis shows 43% reduction in static power and 19% reduction in EDP.

The Journal of Supercomputing | 2013

Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessors

Hemangee K. Kapoor; Praveen Kanakala; Malti Verma; Shirshendu Das

Advancement in semiconductor technology is allowing to pack more and more processing cores on a single die and scalable directory based protocols are needed for maintaining cache coherence. Most of the currently available directory based protocols are designed for mesh based topology and have the problem of delay and scalability. Cluster based coherence protocol is a better option than flat directory based protocol but the problem of mesh based topology is still exits. On the other hand, tree based topology takes fewer hop counts compared to mesh based topology.In this paper we give a hierarchical cache coherence protocol based on tree based topology. We divide the processing cores into clusters and each cluster shares a higher-level cache. At the next level we form clusters of caches connected to yet another higher-level cache. This is continued up to the top level cache/memory. We give various architectural placements that can benefit from the protocol; hop-count comparison; and memory overhead requirements. Finally, we formally verify the protocol using the Murϕ tool.

VDAT | 2013

Random-LRU: A Replacement Policy for Chip Multiprocessors

Shirshendu Das; Nagaraju Polavarapu; Prateek D. Halwe; Hemangee K. Kapoor

As the number of cores and associativity of the last level cache (LLC) on a Chip Multi-processor increases, the role of replacement policies becomes more vital. Though, pure least recently used (LRU) policy has some issues it has been generally believed that some versions of LRU policy performs better than the other policies. Therefore, a lot of work has been proposed to improve the performance of LRU-based policies. However, it has been shown that the true LRU imposes additional complexity and area overheads when implemented on high associative LLCs. Most of the LRU based works are more motivated towards the performance improvement than the reduction of area and hardware overhead of true LRU scheme. In this paper we proposed an LRU based cache replacement policy especially for the LLC to improve the performance of LRU as well as to reduce the area and hardware cost of pure LRU by more than a half. We use a combination of random and LRU replacement policy for each cache set. Instead of using LRU policy for the entire set we use it only for some number of ways within the set. Experiments conducted on a full-system simulator shows 36% and 11% improvements over miss rate and CPI respectively.

international parallel and distributed processing symposium | 2015

Performance Constrained Static Energy Reduction Using Way-Sharing Target-Banks

Shounak Chakraborty; Shirshendu Das; Hemangee K. Kapoor

Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.

ieee international conference on dependable autonomic and secure computing | 2013

Towards a Better Cache Utilization Using Controlled Cache Partitioning

Prateek D. Halwe; Shirshendu Das; Hemangee K. Kapoor

Many multi-core processors nowadays employ a shared Last Level Cache (LLC). Partitioning LLC becomes more important as LLC is shared among the cores. Past research has demonstrated that the traditional least recently used (LRU) based partitioning cum replacement policy has adverse effects on parameters like instruction per cycle (IPC), miss rate and speedup. This leads to poor performance in an environment when multiple cores compete for one global LLC. Applications, enjoying locality of reference are purely benefited by LRU, however LRU fails for the applications showing working set size (WSS) large than the LLC size. In this work, we propose a scheme which allows cores to steal/donate their lines upto a threshold and give them a chance to adjust their partition when there is a miss. Instead of maintaining strict target partitioning, we introduce a flexible threshold window. Our evaluation with multiprogrammed workloads shows significant performance improvement.

ACM Transactions on Design Automation of Electronic Systems | 2016

A Framework for Block Placement, Migration, and Fast Searching in Tiled-DNUCA Architecture

Shirshendu Das; Hemangee K. Kapoor

Multicore processors have proliferated several domains ranging from small-scale embedded systems to large data centers, making tiled CMPs (TCMPs) the essential next-generation scalable architecture. NUCA architectures help in managing the capacity and access time for such larger cache designs. It divides the last-level cache (LLC) into multiple banks connected through an on-chip network. Static NUCA (SNUCA) has a fixed address mapping policy, whereas dynamic NUCA (DNUCA) allows blocks to relocate nearer to the processing cores at runtime. To allow this, DNUCA divides the banks into multiple banksets and a block can be placed in any bank within a particular bankset. The entire bankset may need to be searched to access a block. Optimal bankset searching mechanisms are essential for getting the benefits from DNUCA. This article proposes a DNUCA-based TCMP architecture called TLD-NUCA. It reduces the LLC access time of TCMP and also allows a heavily loaded bank to distribute its load among the underused banks. Instead of other DNUCA designs, TLD-NUCA considers larger banksets. Such relaxations result in more uniform load distribution than existing DNUCA-based TCMP (T-DNUCA). Considering larger banksets improves the utilization factor, but T-DNUCA cannot implement it because of its expensive searching mechanism. TLD-NUCA uses a centralized directory, called TLD, to search a block from all the banks. Also, the proposed block placement policy reduces the instances when the central TLD needs to be contacted. It does not require the expensive simultaneous search as needed by T-DNUCA. Better cache utilization and a reduction in LLC access time improve the miss rate as well as the average memory access time (AMAT). Improving the miss rate and AMAT results in improvements in cycles per instructions (CPI). Experimental analysis found that TLD-NUCA improves performance by 6.5% as compared to T-DNUCA. The improvement is 13% as compared to the SNUCA-based TCMP design.

acm symposium on applied computing | 2015

Dynamic associativity management using utility based way-sharing

Shirshendu Das; Hemangee K. Kapoor

The non-uniform distribution of memory accesses of todays applications affect the performance of cache memories. Due to such non-uniform accesses some sets of large sized caches are used heavily while some other sets are used lightly. This paper presents a technique WS-DAM, to dynamically increase the associativity of the heavily used sets without increasing the cache size. The heavily used sets can use the idle ways of the lightly used sets to distribute the load. A limited number of ways from every lightly used set are reserved for the heavily used sets. To search a block in a heavily used set, both: the set and the entire reserve area is searched. To reduce the cost of searching the entire reserve storage an additional tag-array is used. During execution the sets are re-categorized at intervals. The proposed technique needs much lesser storage, area and power overhead as compared to the other similar techniques. It improves both miss rate and CPI by 14.46% and 6.63% respectively as compared to an existing technique called V-Way. WS-DAM is also compared with another existing proposal called CMP-VR and it improves the performance by 9% and 4.20% in terms of miss-rate and CPI respectively.

Explore More