Sourav Roy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sourav Roy is active.

Explore More

Publication

Featured researches published by Sourav Roy.

high performance computing and communications | 2009

Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory

Sutirtha Sanyal; Sourav Roy; Adrian Cristal; Osman S. Unsal; Mateo Valero

Transactional Memory (TM) is an emerging technology which promises to make parallel programming easier. However, to be efficient, underlying TM system should protect only true shared data and leave thread-local data out of the transaction. This speed-up the commit phase of the transaction which is a bottleneck for a lazily versioned HTM. This paper proposes a scheme in the context of a lazy-lazy(lazy conflict detection and lazy data versioning) Hardware Transactional Memory (HTM) system to identify dynamically variables which are local to a thread and exclude them from the commitset of the transaction. Our proposal covers sharing of both stack and heap but also filters out local accesses to both of them. We also propose, in the same scheme, to identify local variables for which versioning need not be maintained.For evaluation, we have implemented a lazy-lazy model of HTM in line with the conventional and the scalable version of the TCC in a full system simulator. For operating system, we have modified the Linux kernel. We got an average speed-up of 31% for the conventional TCC, on applications from the STAMP benchmark suite. For the scalable TCC we got an average speedup of 16%. Also, we found that on average 99% of the local variables can be safely omitted when recording their old values to handle aborts.

international parallel and distributed processing symposium | 2009

Clock gate on abort: Towards energy-efficient hardware Transactional Memory

Sutirtha Sanyal; Sourav Roy; Adrián Cristal; Osman S. Unsal; Mateo Valero

Transactional Memory (TM) is an emerging technology which promises to make parallel programming easier compared to earlier lock based approaches. However, as with any form of speculation, Transactional Memory too wastes a considerable amount of energy when the speculation goes wrong and transaction aborts. For Transactional Memory this wastage will typically be quite high because programmer will often mark a large portion of the code to be executed transactionally[4].

international conference on vlsi design | 2009

H-NMRU: A Low Area, High Performance Cache Replacement Policy for Embedded Processors

Sourav Roy

We propose a low area, high performance cache replacement policy for embedded processors called Hierarchical Non-Most-Recently-Used (H-NMRU). The H-NMRU is a parameterizable policy where we can trade-off performance with area. We extended the Dinero cache simulator [1] with the H-NMRU policy and performed architectural exploration with a set of cellular and multimedia benchmarks. On a 16 way cache, a two level H-NMRU policy where the first and second levels have 8 and 2 branches respectively, performs as good as the Pseudo-LRU (PLRU) policy with storage area saving of 27%. Compared to true LRU, H-NMRU on a 16 way cache saves huge amount of area (82%) with marginal increase of cache misses (3%). Similar result was also noticed on other cache like structures like branch target buffers. Therefore the two level H-NMRU cache replacement policy (with associativity/2 and 2 branches on the two levels) is a very attractive option for caches on embedded processors with associativities greater than 4.

architectures for networking and communications systems | 2013

Asymmetric scaling on network packet processors in the dark silicon era

Sourav Roy; Xiaomin Lu; Edmund Gieske; Peng Yang; Jim Holt

This paper introduces a new architectural technique called asymmetric scaling on heterogeneous multi-core network processor architectures to mitigate the problem of dark silicon in future process technologies. In asymmetric scaling, the number of low power cores is increased at a higher rate than the number of high performance cores over process generations. Using an analytical model we show that coupled with fixed voltage-frequency scaling, asymmetric scaling can maintain the power density of the chip at the same level for several process generations, while increasing computational capabilities according to Dennardian scaling. Asymmetric scaling aligns nicely with the application characteristics on a network packet processor. To illustrate the concept, we discuss the Layerscape network processor architecture that incorporates a general purpose layer of high performance cores with an accelerated packet processing layer of low power cores. We discuss several techniques that can be applied to reduce the power density of low power cores. Using a representative packet forwarding workload, we show that shallow-pipeline, dual-issue, in-order cores with appropriate hardware acceleration and limited on-chip memory are a good choice for the low power processor layer.

international symposium on vlsi design, automation and test | 2008

Estimation of energy consumed by software in processor caches

Lokesh Chandra; Sourav Roy

We present a comprehensive high-level estimation framework for power consumed by the software in the processor caches. We demonstrate the framework on two types of caches commonly used in a modern day processor ARM1136 viz., L1 Data Cache and L2 Unified Cache. The major contribution of this paper is a linear energy model for the caches. The energy characterization starts with recognition of different types of operations in the caches. Further the energy of each operation is divided into sequential and non-sequential part depending on whether the operation is stand alone or happens in a burst with other operations. There is also an idle energy component of the cache since the cache may be inactive for considerable amount of time. The average error magnitude of the energy model when applied on the ARM1136 L1 Data Cache and L2 Unified Cache with a large suite of benchmarks is 1.7%, whereas the maximum error is less than 4.0%.

international conference on vlsi design | 2007

JouleQuest: An Accurate Power Model for the StarCore DSP Platform

Ashish Mathur; Sourav Roy; Rajat Bhatia; Arup Chakraborty; Vijay Bhargava; Jatin Bhartia

This paper describes the design, validation and integration of JouleQuest: a comprehensive power estimation framework for the StarCore DSP platform. The goal of this work is to provide a power model coupled to a fast platform simulator, that can accurately predict the power variability on the platform. The power consumption model for the DSP core is an instruction level model while the power models for the peripheral components are based on functional operations executed by the blocks. The DSP core instruction level power model is a generic model applicable to any VLIW DSP core and improves on existing VLIW models by reducing the model complexity to O(n) where n is the number of processor instructions. The peripheral power models are based on a novel technique of characterizing the power of operations and operation sequences executed on the peripheral block. This allows accurate power modeling for complex peripheral blocks like cache controllers, arbiters etc. having significant parallel operation execution. The platform power has been verified to give an average error of approximately 5% across a large suite of DSP and control benchmarks having high power variability. The paper concludes with a case study demonstrating the usage of the power simulator for energy optimization of ITU-T G.729 speech-codec software

International Journal of Parallel Programming | 2010

H-NMRU: An Efficient Cache Replacement Policy with Low Area

Sourav Roy

In present multi-core devices, the individual processors do not need to operate at the highest possible frequencies. Instead there is a need to reduce the power, complexity and area of individual processor components like caches. In this paper we propose a low area, high performance cache replacement policy for embedded processors called Hierarchical Non-Most-Recently-Used (H-NMRU). The H-NMRU is a parameterizable policy where we can trade-off performance with area. We extended the Dinero cache simulator with the H-NMRU policy and performed architectural exploration with a set of cellular and multimedia benchmarks. On a 16 way cache, a two level H-NMRU policy where the first and second levels have 8 and 2 branches, respectively, performs as good as the Pseudo-LRU policy with storage area saving of 27%. Compared to true LRU, H-NMRU on a 16 way cache saves huge amount of area (82%) with marginal increase of cache misses (3%). Similar result was also noticed on other cache like structures like branch target buffers. Therefore, the two level H-NMRU cache replacement policy (with associativity/2 and 2 branches on the two levels) is a very attractive option for caches on embedded processors with associativities greater than 4. We present a case-study where it can be used on the L2 cache with substantial gain in performance and area for single and dual core platforms.

high performance computing and communications | 2010

QuickTM: A Hardware Solution to a High Performance Unbounded Transactional Memory

Sutirtha Sanyal; Sourav Roy

Transactional Memory (TM) is an emerging technology which simplifies the concurrency control in a parallel program. In this paper we propose Quick TM, a new hardware transactional memory (HTM) architecture. It incorporates three features to address known bottlenecks in the existing HTM architectures. First, we propose hardware-only dynamic detection of true-shared variables. Our result shows that true-shared variables account for only about 20% in the commit set of any transaction. Rest can be completely disregarded from the commit phase. This shortens every commit phase drastically resulting in a significant overall speed-up. Second, we keep both the speculative and the last committed version local to each processor. This benefits when a transaction is repeated in a loop. The processor request gets satisfied from the L1 data cache(L1D) itself. Furthermore, since both the versions are locally maintained, the commit action involves only broadcast of addresses. Third, we have proposed a mechanism to address overflow in transactions. In our proposal, each processor continues to run transactions even if one processor has overflown its L1D. Our technique eliminates the stall of a thread even if it conflicts with the overflown transaction. Overflown transaction commits in-place and periodically broadcasts its write set addresses, termed “partial commit”. This gradually reduces conflicts and allows other threads to progress towards commit. Moreover, the technique does not require any additional hardware at any memory hierarchy level beyond L1. Quick TM outperforms the state-of-the-art scalable HTM architecture, Scalable-TCC, on average by 20% in the latest TM benchmark suite STAMP. It outperforms the original TCC proposal with serialized commit by 28% on average. Maximum speed-up achieved in these two cases are 43% and 67% respectively. Our proposal handles transaction overflow gracefully and outperforms the current overflow-aware HTM proposal, One TM-concurrent by 12% on average.

microprocessor test and verification | 2015

Leveraging Virtual Prototype Models for Hardware Verification of an Accelerated Network Packet Processing Engine

Sourav Roy; Nikhil Jain; Sandeep Jain; Robert Page

This paper describes the co-simulation methodology adopted for hardware verification of a next generation network packet processing engine (Advanced I/O Processor or AIOP) utilizing virtual prototype models developed originally for software verification. Though co-simulation strategies are common in verification of stand-alone processors, they have seldom been used for mega-modules and SoC, which consist of large number of cores and accelerators like the AIOP. The cosimulation platform containing the AIOP functional model is used as a dynamic scoreboard in the top-level Universal Verification Methodology (UVM) test-bench. Since functional models are untimed or loosely-timed, the primary challenge here is to maintain synchronization between the design-under-test (DUT) and the functional model. This paper describes in detail the synchronization challenges encountered while running multicore software and how they were solved with minimal sacrifice to verification quality. Using this methodology, we unearthed more than 15 critical bugs in the DUT as well as large number of issues in the software libraries and functional models.

great lakes symposium on vlsi | 2014

High level energy modeling of controller logic in data caches

Preeti Ranjan Panda; Sourav Roy; Srikanth Chandrasekaran; Namita Sharma; Jasleen Kaur; Sarath Kumar Kandalam; N Nagaraj

In modern embedded processor caches, a significant amount of energy dissipation occurs in the controller logic part of the cache. Previous power/energy modeling tools have focused on the core memory part of the cache. We propose energy models for two of these modules -- Write Buffer and Replacement logic. Since this hardware is generally synthesized by designers, our power models are also based on empirical data. We found a linear dependence of the per-access write buffer energy on the write buffer depth and write width. We validated our models on several different benchmark examples, using different technology nodes. Our models generate energy estimates that are within 4.2% of those measured by detailed power simulations, making the models valuable mechanisms for rapid energy estimates during architecture exploration.

Explore More