S Gurunarayanan
Birla Institute of Technology and Science
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by S Gurunarayanan.
The Journal of Supercomputing | 2015
Nitin Chaturvedi; Arun Subramaniyan; S Gurunarayanan
Most of today’s chip multiprocessors implement last-level shared caches as non-uniform cache architectures. A major problem faced by such multicore architectures is cache line placement, especially in scenarios where multiple cores compete for line usage in the single non-uniform shared L2 cache. Block migration has been suggested to overcome the problem of optimum placement of cache blocks. Previous research, however, shows that an uncontrolled block migration scheme leads to scenarios where a cache line ‘ping-pongs’ between two requesting cores resulting in higher access latency for both the requestors and greater power dissipation. To address this problem, this paper first proposes a mechanism to dynamically profile data block usage from different cores on the chip. We then propose an adaptive migration–replication scheme for shared last-level non-uniform cache architectures that adapts between selectively replicating frequently used cache lines near the requesting cores and cache line migration towards the requesting core in case of fewer requests. AMR eliminates ‘ping-ponging’ of cache lines between the banks of the requesting cores. However, any mechanism that dynamically adapts between migration and replication at runtime is bound to have a complex search scheme to locate data blocks. To simplify the data lookup policy, this work also presents an efficient data access mechanism for non-uniform cache architectures. Our proposal relies on low overhead and highly accurate in-hardware pointers to keep track of the on-chip location of the cache block. We show that our proposed scheme reduces the completion time by on average 12.25, 8.1 and 3xa0% and energy consumption by 11.65, 8.5 and 2.1xa0% when compared to state-of-the-art last-level cache management schemes S-NUCA, D-NUCA and HK-NUCA, respectively. SPEC and PARSEC benchmarks were used to thoroughly evaluate our proposal.
2013 International Conference on Advanced Electronic Systems (ICAES) | 2013
Nitin Chaturvedi; Pawan Sharma; S Gurunarayanan
Next generation multicore processors and their applications will process massive amounts of data with significant sharing. Data movement between cores and shared cache hierarchy and its management impacts memory access latency and consumes power. The efficiency of high-performance shared-memory multicore processors depends on the design of the on-chip cache hierarchy and the coherence protocol. Current multicore cache hierarchies uses a fixed size of cache block in the cache organization and in the design of the coherence protocols. The fixed size of block in the set is basically choosen to match average spatial locality requirement across a range of applications, but it also results in wastage of bandwidth because of unnecessary coherence traffic for shared data. The additional bandwidth has a direct impact on the overall energy consumption. In this paper, we present a new adaptable and implementable cache design with novel proposal of the design of cache coherence protocol that eliminate unnecessary coherence traffic and match data movements to an applications spatial locality.
International Journal of Computer Applications | 2010
Nitin Chaturvedi; Jithin Thomas; S Gurunarayanan
This paper proposes a novel efficient Non-Uniform Cache Architecture (NUCA) scheme for the Last-Level Cache (LLC) to reduce the average on-chip access latency and improve core isolation in Chip Multiprocessors (CMP). The architecture proposed is expected to improve upon the various NUCA schemes proposed so far such as S-NUCA, D-NUCA and SPNUCA[9][10][5] in terms of average access latency without a significant reduction in the hit rate. The complete set of L2 banks is divided into various zones. Each core belongs to one particular zone which is the closest to it. Consequently, adjacent cores are grouped into the same zone. Each zone individually follows the SP-NUCA scheme [5] for maintaining core isolation and sharing common blocks. However, blocks that need to be shared by cores which belong to different zones are replicated. This scheme is much more scalable than the SP-NUCA scheme and bounds the maximum on-chip access latency to a lower value as the number of cores increases. This paper merely details the proposed scheme. The claims made regarding the benefits of the scheme shall be substantiated through simulations and a detailed comparative study in the future. The intended simulation methodology and architectural framework to be used in this regard have also been mentioned.
international conference on computer communications | 2017
Divya Suneja; Nitin Chaturvedi; S Gurunarayanan
With the advent of technology, a change from feature size to nanometer regime resulted in the scaling of operating voltages and dimensions. Reducing them can greatly boost the energy efficiency but it also leads to increased design challenges. To deal with the activity limitations imposed by the low overdrive voltage and the intrinsic read stability/write margin trade off, large scale SRAM arrays largely rely on assist techniques. These techniques address the problem of preserving the functionality of the 6T SRAM cell by improving the read and write margins of the cell. In this paper, we show a comprehensive analysis of the effectiveness of some assist methods. This paper presents the margin sensitivity analysis of assist techniques to assess the productiveness of assist methods and to investigate their direct impact on the voltage sensitive yield. In addition, the effect of temperature variation and process variation have also been analyzed.
international conference on computer communications | 2017
Suvi Jain; Nitin Chaturvedi; S Gurunarayanan
Using FinFET for designing of SRAM cells has shown a great deal of advantages over planar bulk devices due to the additional control on the gates and due to fully depleted behavior. The improvements have been noted in sub-threshold slope, drive currents, short-channel effects and mismatches. As the memories become denser, the stability of the SRAM cells becomes a point of great concern. This calls for the need of assist circuitry for improving the reliability and stability of the cells. In this work, a write assist technique is discussed to improve the stability of the device. This design decreases the WLCRIT drastically and reduces the write delay of the cell. The simulations have been carried out on HSPICE with 32 nm PTM libraries for FinFET.
international conference on computing analytics and security trends | 2016
Nikunj Bhimsaria; Nitin Chaturvedi; S Gurunarayanan
This paper investigates the application of Spin-Transfer Torque Magnetic Tunnel Junctions (STT-MTJ) in nonvolatile memory design. MTJs are favored in NVM design as they can provide indefinite data retention and very high read/write speeds. In this work, we have presented the design and analysis of a non-volatile, low power Muller C-element with almost-zero leakage current and instantaneous back-up and wake-up times. The simulations results of the C-element based on technology incorporating CMOS FD-SOI and Spin Transfer Torque MTJs are compared with those of a design in which FinFETs are utilized instead of the FDSOI transistors. The two implementations are compared on the basis of idle power consumption, energy required for read and write functionality as well as output delay in addition to the scalability analysis of both the technologies.
Microprocessors and Microsystems | 2015
Nitin Chaturvedi; S Gurunarayanan
Most of todays multi-core processors feature last level shared L2 caches. A major problem faced by such multi-core architectures is cache contention, where multiple cores compete for usage of the single shared L2 cache. Previous research shows that uncontrolled sharing leads to scenarios where one core evicts useful L2 cache content belonging to another core. To address this problem, the paper first presents a cache miss classification scheme - CII: Compulsory, Inter-processor and Intra-processor misses - for CMPs with shared caches and its comparison to the 3C miss classification for a traditional uniprocessor, to provide a better understanding of the interactions between memory references of different processors at the level of shared cache in a CMP. We then propose a novel approach, called block pinning for eliminating inter-processor misses and reducing intra-processor misses in a shared cache. Furthermore, we show that an adaptive block pinning scheme improves over the benefits obtained by the block pinning and set pinning scheme by significantly reducing the number of off-chip accesses. This work also proposes two different schemes of relinquishing the ownership of a block to avoid domination of ownership by a few active cores in the multi-core system which results in performance degradation. Extensive analysis of these approaches with SPEC and PARSEC benchmarks are performed using a full system simulator.
international conference on computational intelligence and communication networks | 2013
Nitin Chaturvedi; S Gurunarayanan
With advent of new technologies there is exponential increase in multi-core processor (CMP) cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with single, uniform access latency. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this issue. A NUCA partitions the complete cache memory into smaller multiple banks and allows banks near the processor cores to have lower access latencies than those further away, thus reducing the effects of the caches internal wire delays. Traditionally, NUCA organizations have been classified as static (S-NUCA) and dynamic (D- NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. In D-NUCA designs a data blocks can migrate towards the processor core that access them most frequently. This migration of data blocks will increase network traffic. The short life time of data blocks and low spatial locality in many applications results in eviction of block with few unused words. This effectively increases miss rate, and waste on chip network bandwidth. Unused word transfers also wastes a large fraction of on chip energy consumption.In this paper, we present an efficient and implementable cache design that eliminate unnecessary coherence traffic and match data movements to an applications spatial locality. It also presents one way to scale on-chip coherence with less costeffective techniques such as shared caches augmented to track cached copies, explicit eviction notification and hierarchal design. Based on our scalability analysis of this cache design we predict that this design consistently reduce miss rate and improve the fraction of data transmitted that is actually utilized by the application.
Procedia Computer Science | 2015
Nitin Chaturvedi; Arun Subramaniyan; S Gurunarayanan
International Journal of Distributed and Parallel systems | 2013
Nitin Chaturvedi; S Gurunarayanan