Is this you? Create Your Porfile

Hemangee K. Kapoor

Indian Institute of Technology Guwahati

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hemangee K. Kapoor is active.

Explore More

Publication

Featured researches published by Hemangee K. Kapoor.

Information Processing Letters | 2004

Modelling and verification of delay-insensitive circuits using CCS and the concurrency workbench

Hemangee K. Kapoor; Mark B. Josephs

The modelling of delay-insensitive asynchronous circuits in the process calculus CCS is addressed. MUST-testing (rather than bisimulation) is found to support verification both of file property of delay-insensitivity and of design by stepwise refinement. Automated verification is possible with a well-known tool, the Edinburgh Concurrency Workbench.

Microprocessors and Microsystems | 2014

Victim retention for reducing cache misses in tiled chip multiprocessors

Shirshendu Das; Hemangee K. Kapoor

This paper presents CMP-VR (Chip-Multiprocessor with Victim Retention), an approach to improve cache performance by reducing the number of off-chip memory accesses. The objective of this approach is to retain the chosen victim cache blocks on the chip for the longest possible time. It may be possible that some sets of the CMPs last level cache (LLC) are heavily used, while certain others are not. In CMP-VR, some number of ways from every set are used as reserved storage. It allows a victim block from a heavily used set to be stored into the reserve space of another set. In this way the load of heavily used sets are distributed among the underused sets. This logically increases the associativity of the heavily used sets without increasing the actual associativity and size of the cache. Experimental evaluation using full-system simulation shows that CMP-VR has less off-chip miss-rate as compared to baseline Tiled CMP. Results are presented for different cache sizes and associativity for CMP-VR and baseline configuration. The best improvements obtained are 45.5% and 14% in terms of miss rate and cycles per instruction (CPI) respectively for a 4 MB, 4-way set associative LLC. Reduction in CPI and miss rate together guarantees performance improvement.

international symposium on electronic system design | 2013

Dynamic Associativity Management Using Fellow Sets

Shirshendu Das; Hemangee K. Kapoor

The memory accesses of to days applications are non-uniformly distributed across the cache sets and as a result some sets of the cache are heavily used while some other sets remain underutilized. This paper presents CMP-SVR, an approach to dynamically increase the associativity of heavily used sets without increasing the cache size. It divides the last level cache (LLC) into two sections: normal storage (NT) and reserve storage (RT). Some number of ways (25% to 50%) from each set are reserved for RT and the remaining ways belongs to NT. The sets are divided into some groups called fellow-groups and a set can use the reserve ways of its fellow sets to increase its associativity during execution. An additional tag-array (SA-TGS) for RT has been used to make the searching easier and less expensive. SA-TGS is almost like an N-way set associative cache where its associativity depends on the number of reserve ways per set and the number of sets in a fellow-group. CMP-SVR has less storage, area and energy overhead as compared to the other techniques proposed for dynamic associativity management. Full system simulation shows on average of 8% improvement in cycles per instruction (CPI) and 28% improvement in miss-rate.

international conference on vlsi design | 2015

Exploration of Migration and Replacement Policies for Dynamic NUCA over Tiled CMPs

Shirshendu Das; Hemangee K. Kapoor

Multicore processors have proliferated several domains ranging from small scale embedded systems to large data-centers, making tiled CMPs (TCMP) the essential next generation scalable architecture. More processors need proportionally large cache to support the concurrent applications. NUCA architectures help in managing the capacity and access time for such larger cache designs. Static NUCA (SNUCA) has a fixed address mapping policy whereas dynamic NUCA (DNUCA) allows blocks to relocate nearer to the processing cores at runtime. The DNUCA architectures are well explored for systems with centralized cache banks and having processors along the periphery. SNUCA is well understood and explored for tiled CMPs whereas the same is not the case for DNUCA. Towards exploring various implementations of DNUCA for tiled CMPs, this paper presents an architecture along with migration and replacement policies for tiled CMPs. Experimental results show improvements with respect to TCMP based SNUCA. We discuss the differences and challenges of TCMP based DNUCA compared to original DNUCA, and present results for different configurations of migration and replacement policies. Simulation results shows improvements over TCMP based SNUCA by 44.7% and 9.6% in terms of miss-rate and cycles per instruction (CPI) respectively.

acm symposium on applied computing | 2015

Static energy reduction by performance linked cache capacity management in tiled CMPs

Hemangee K. Kapoor; Shirshendu Das; Shounak Chakraborty

With the rapid growth in semiconductor technology, modern processor chips have multiple number of processor cores with multi-level on-chip caches. Recent study about the chip power consumption indicates that, the principal amount of chip power is consumed by the on chip caches which can be divided into two major parts: dynamic power and static power. Dynamic power is consumed when the cache is accessed and static power is generally referred as leakage power of the cache. This increased power consumption of chip increases chip-temperature which increases on chip leakage power. In this paper we attempt to reduce the static power consumption by intelligently powering off cache banks and mapping its requests to other active cache banks. We use a performance based criteria for the shutdown decision and the bank to be powered off is chosen based on usage statistics. The remapping of requests for a powered off cache bank is done at the L2-controller and thus the L1 caches are transparent to this approach. Thus, depending on the applications workload and data distribution, a controlled number of banks can be dynamically shutdown saving on the leakage power dissipation. Experimental analysis shows 43% reduction in static power and 19% reduction in EDP.

IEEE Transactions on Industrial Informatics | 2013

Formal Approach for DVS-Based Power Management for Multiple Server System in Presence of Server Failure and Repair

Lalit Chandnani; Hemangee K. Kapoor

The paper presents a DVS-based power management policy for multiprocessor systems. The aim is to optimize power consumption by keeping the job loss probability as a system-wide constraint. Optimal values for service rate are computed using an ideal setting where speed can change continuously. As real processors have discrete speed levels, we switch between two nearby speeds to achieve the optimal rate. We develop a formal model of such a system using the probabilistic model checker PRISM and prove properties satisfied by the system. We demonstrated the applicability of the policy on multiple servers and under both kinds of deadlines: DBS and DES. For a constraint value of 25%, the DBS model achieved power savings of 29.46% in theoretical, 8.75% in actual, and 7.23% in leakage power. The DES model achieved power savings of 30% in theoretical, 11.9% in actual, and 8.7% in leakage power. For robustness, a repair facility was used which can have repairmen varying from one to the total number of servers.

design automation conference | 2004

Decomposing specifications with concurrent outputs to resolve state coding conflicts in asynchronous logic synthesis

Hemangee K. Kapoor; Mark B. Josephs

Synthesis of asynchronous logic using the tool Petrify requires a state graph with a complete state coding. It is common for specifications to exhibit concurrent outputs, but Petrify is sometimes unable to resolve the state coding conflicts that arise as a result, and hence cannot synthesise a circuit. A pair of decomposition heuristics (expressed in the language of Delay-Insensitive Sequential Processes) are given that helps one to obtain a synthesisable specification. The second heuristic has been successfully applied to a set of nine benchmarks to obtain significant reductions both in area and in synthesis time, compared with synthesis performed on the original specifications.

The Journal of Supercomputing | 2013

Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessors

Hemangee K. Kapoor; Praveen Kanakala; Malti Verma; Shirshendu Das

Advancement in semiconductor technology is allowing to pack more and more processing cores on a single die and scalable directory based protocols are needed for maintaining cache coherence. Most of the currently available directory based protocols are designed for mesh based topology and have the problem of delay and scalability. Cluster based coherence protocol is a better option than flat directory based protocol but the problem of mesh based topology is still exits. On the other hand, tree based topology takes fewer hop counts compared to mesh based topology.In this paper we give a hierarchical cache coherence protocol based on tree based topology. We divide the processing cores into clusters and each cluster shares a higher-level cache. At the next level we form clusters of caches connected to yet another higher-level cache. This is continued up to the top level cache/memory. We give various architectural placements that can benefit from the protocol; hop-count comparison; and memory overhead requirements. Finally, we formally verify the protocol using the Murϕ tool.

software engineering and formal methods | 2006

Formal Modelling and Verification of an Asynchronous DLX Pipeline

Hemangee K. Kapoor

A five stage pipeline of an asynchronous DLX processor is modelled and its control flow is verified. The model is built using an asynchronous pipeline of latches separated by processing logic. We model two versions of this pipeline: one using latch controllers with four-phase semi-decoupled and another using fully-decoupled protocol. All the processing units are modelled as processes in the PROMELA language of the Spin tool. The model is verified in Spin by means of assertions, LTL properties and progress labels. A useful observation obtained from the study is that: although the semi-decoupled protocol has the potential to hold a data item in every latch, in the presence of processing logic, at most alternate stages can execute at a given time. Its implication being, in the case of control and data hazards no pipeline stalls are necessary, in the case of fully decoupled version, all stages could execute valid instructions at the same time. All the models were verified to be free from deadlock

VDAT | 2013

Random-LRU: A Replacement Policy for Chip Multiprocessors

Shirshendu Das; Nagaraju Polavarapu; Prateek D. Halwe; Hemangee K. Kapoor

As the number of cores and associativity of the last level cache (LLC) on a Chip Multi-processor increases, the role of replacement policies becomes more vital. Though, pure least recently used (LRU) policy has some issues it has been generally believed that some versions of LRU policy performs better than the other policies. Therefore, a lot of work has been proposed to improve the performance of LRU-based policies. However, it has been shown that the true LRU imposes additional complexity and area overheads when implemented on high associative LLCs. Most of the LRU based works are more motivated towards the performance improvement than the reduction of area and hardware overhead of true LRU scheme. In this paper we proposed an LRU based cache replacement policy especially for the LLC to improve the performance of LRU as well as to reduce the area and hardware cost of pure LRU by more than a half. We use a combination of random and LRU replacement policy for each cache set. Instead of using LRU policy for the entire set we use it only for some number of ways within the set. Experiments conducted on a full-system simulator shows 36% and 11% improvements over miss rate and CPI respectively.

Explore More