Mainak Chaudhuri
Indian Institute of Technology Kanpur
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mainak Chaudhuri.
international symposium on microarchitecture | 2009
Mainak Chaudhuri
Cache blocks often exhibit a small number of uses during their life time in the last-level cache. Past research has exploited this property in two different ways. First, replacement policies have been designed to evict dead blocks early and retain the potentially live blocks. Second, dynamic insertion policies attempt to victimize single-use blocks (dead on fill) as early as possible, thereby leaving most of the working set undisturbed in the cache. However, we observe that as the last-level cache grows in capacity and associativity, the traditional dead block prediction-based replacement policy loses effectiveness because often the LRU block itself is dead leading to an LRU replacement decision. The benefit of dynamic insertion policies is also small in a large class of applications that exhibit a significant number of cache blocks with small, yet more than one, uses. To address these drawbacks, we introduce pseudo-last-in-first-out (pseudo-LIFO), a fundamentally new family of replacement heuristics that manages each cache set as a fill stack (as opposed to the traditional access recency stack). We specify three members of this family, namely, dead block prediction LIFO, probabilistic escape LIFO, and probabilistic counter LIFO. The probabilistic escape LIFO (peLIFO) policy is the central contribution of this paper. It dynamically learns the use probabilities of cache blocks beyond each fill stack position to implement a new replacement policy. Our detailed simulation results show that peLIFO, while having less than 1% storage overhead, reduces the execution time by 10% on average compared to a baseline LRU replacement policy for a set of fourteen single-threaded applications on a 2 MB 16-way set associative L2 cache. It reduces the average CPI by 19% on average for a set of twelve multiprogrammed workloads while satisfying a strong fairness requirement on a four-core chip-multiprocessor with an 8 MB 16-way set associative shared L2 cache. Further, it reduces the parallel execution time by 17% on average for a set of six multi-threaded programs on an eight-core chip-multiprocessor with a 4 MB 16-way set associative shared L2 cache. For the architectures considered in this paper, the storage overhead of the peLIFO policy is one-fifth to half of that of a state-of-the-art dead block prediction-based replacement policy. However, the peLIFO policy delivers better average performance for the selected single-threaded and multiprogrammed workloads and similar average performance for the multi-threaded workloads compared to the dead block prediction-based replacement policy.
high-performance computer architecture | 2009
Mainak Chaudhuri
As the last-level on-chip caches in chip-multiprocessors increase in size, the physical locality of on-chip data becomes important for delivering high performance. The non-uniform access latency seen by a core to different independent banks of a large cache spread over the chip necessitates active mechanisms for improving data locality. The central proposal of this paper is a fully hardwired coarse-grain data migration mechanism that dynamically monitors the access patterns of the cores at the granularity of a page to reduce the book-keeping overhead and decides when and where to migrate an entire page of data to amortize the performance overhead. The page-grain migration mechanism is compared against two variants of previously proposed cache block-grain dynamic migration mechanisms and two OS-assisted static locality management mechanisms. Our detailed execution-driven simulation of an eight-core chip-multiprocessor with a shared 16 MB L2 cache employing a bidirectional ring to connect the cores and the L2 cache banks shows that hardwired dynamic page migration, while using only 4.8% of extra storage out of the total L2 cache and book-keeping budget, delivers the best performance and energy-efficiency across a set of shared memory parallel applications selected from the SPLASH-2, SPEC OMP, DARPA DIS, and FFTW suites and multiprogrammed workloads prepared out of the SPEC 2000 and BioBench suites. It reduces execution time by 18.7% and 12.6% on average (geometric mean) respectively for the shared memory applications and the multiprogrammed workloads compared to a baseline architecture that distributes the pages round-robin across the L2 cache banks.
international symposium on computer architecture | 2011
Jayesh Gaur; Mainak Chaudhuri; Sreenivas Subramoney
Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).
international conference on parallel architectures and compilation techniques | 2012
Mainak Chaudhuri; Jayesh Gaur; Nithiyanandan Bashyam; Sreenivas Subramoney; Joseph Nuzman
The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of the hierarchy. This paper introduces cache hierarchy-aware replacement (CHAR) algorithms for inclusive LLCs (or L3 caches) and applies the same algorithms to implement efficient bypass techniques for exclusive LLCs in a three-level hierarchy. In a hierarchy with an inclusive LLC, these algorithms mine the L2 cache eviction stream and decide if a block evicted from the L2 cache should be made a victim candidate in the LLC based on the access pattern of the evicted block. Ours is the first proposal that explores the possibility of using a subset of L2 cache eviction hints to improve the replacement algorithms of an inclusive LLC. The CHAR algorithm classifies the blocks residing in the L2 cache based on their reuse patterns and dynamically estimates the reuse probability of each class of blocks to generate selective replacement hints to the LLC. Compared to the static re-reference interval prediction (SRRIP) policy, our proposal offers an average reduction of 10.9% in LLC misses and an average improvement of 3.8% in instructions retired per cycle (IPC) for twelve single-threaded applications. The corresponding reduction in LLC misses for one hundred 4-way multi-programmed workloads is 6.8% leading to an average improvement of 3.9% in through-put. Finally, our proposal achieves an 11.1% reduction in LLC misses and a 4.2% reduction in parallel execution cycles for six 8-way threaded shared memory applications compared to the SRRIP policy. In a cache hierarchy with an exclusive LLC, our CHAR proposal offers an effective algorithm for selecting the subset of blocks (clean or dirty) evicted from the L2 cache that need not be written to the LLC and can be bypassed. Compared to the TC-AGE policy (analogue of SRRIP for exclusive LLC), our best exclusive LLC proposal improves average throughput by 3.2% while saving an average of 66.6% of data transactions from the L2 cache to the on-die interconnect for one hundred 4-way multi-programmed workloads. Compared to an inclusive LLC design with an identical hierarchy, this corresponds to an average throughput improvement of 8.2% with only 17% more data write transactions originating from the L2 cache.
international symposium on computer architecture | 2004
Mainak Chaudhuri; Mark Heinrich
We introduce the SMTp architecture - an SMT processor augmented with a coherence protocol thread context, that together with a standard integrated memory controller can enable the design of (among other possibilities) scalable cache-coherent hardware distributed shared memory (DSM) machines from commodity nodes. We describe the minor changes needed to a conventional out-of-order multi-threaded core to realize SMTp, discussing issues related to both deadlock avoidance and performance. We then compare SMTp performance to that of various conventional DSM machines with normal SMT processors both with and without integrated memory controllers. On configurations from 1 to 32 nodes, with 1 to 4 application threads per node, we find that SMTp delivers performance comparable to, and sometimes better than, machines with more complex integrated DSM-specific memory controllers. Our results also show that the protocol thread has extremely low pipeline overhead. Given the simplicity and the flexibility of the SMTp mechanism, we argue that next-generation multi-threaded processors with integrated memory controllers should adopt this mechanism as a way of building less complex high-performance DSM multiprocessors.
international conference on supercomputing | 2002
Daehyun Kim; Mainak Chaudhuri; Mark Heinrich
Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory controller that supports address re-mapping techniques that improve data locality. Both active memory approaches create coherence problems---even on uniprocessor systems---since there are either additional processors operating on the data directly, or the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we propose a new technique to solve the coherence problem by extending the coherence protocol. Our active memory controller leverages and extends the coherence mechanism, so that re-mapping techniques work transparently on both uniprocessor and multiprocessor systems.We present a microarchitecture for an active memory controller with a programmable core and specialized hardware that accelerates cache line assembly and disassembly. We present detailed simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. In addition to uniprocessor speedup, we show single-node multiprocessor speedup for parallel active memory applications and discuss how the same controller architecture supports coherent multi-node systems called active memory clusters.
IEEE Transactions on Computers | 2004
Daehyun Kim; Mainak Chaudhuri; Mark Heinrich; Evan Speight
We introduce an architectural approach to improve memory system performance in both uniprocessor and multiprocessor systems. The architectural innovation is a flexible active memory controller backed by specialized cache coherence protocols that permit the transparent use of address remapping techniques. The resulting system shows significant performance improvement across a spectrum of machine configurations, from uniprocessors through single-node multiprocessors (SMPs) to distributed shared memory clusters (DSMs). Address remapping techniques exploit the data access patterns of applications to enhance their cache performance. However, they create coherence problems since the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we present a new approach to solve the coherence problem. We leverage and extend the cache coherence protocol so that our techniques work transparently to efficiently support uniprocessor, SMP and DSM active memory systems. We detail the coherence protocol extensions to support our active memory techniques and present simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. We also show remarkable performance improvement on small to medium-scale SMP and DSM multiprocessors, allowing some parallel applications to continue to scale long after their performance levels off on normal systems.
international conference on parallel and distributed systems | 2012
Prabhakar Misra; Mainak Chaudhuri
Graphics processing units (GPUs) have emerged as a strong candidate for high-performance computing. While regular data-parallel computations with little or no synchronization are easy to map on the GPU architectures, it is a challenge to scale up computations on dynamically changing pointer-linked data structures. The traditional lock-based implementations are known to offer poor scalability due to high lock contention in the presence of thousands of active threads, which is common in GPU architectures. In this paper, we present a performance evaluation of concurrent lock-free implementations of four popular data structures on GPUs. We implement a set using lock-free linked list, hash table, skip list, and priority queue. On the first three data structures, we evaluate the performance of different mixes of addition, deletion, and search operations. The priority queue is designed to support retrieval and deletion of the minimum element and addition operations to the set. We evaluate the performance of these lock-free data structures on a Tesla C2070 Fermi GPU and compare it with the performance of multi-threaded lock-free implementations for CPU running on a 24-core Intel Xeon server. The linked list, hash table, skip list, and priority queue implementations achieve speedup of up to 7.4, 11.3, 30.7, and 30.8, respectively on the GPU compared to the Xeon server.
ieee international symposium on workload characterization | 2013
Ragavendra Natarajan; Mainak Chaudhuri
Recent years have seen a large volume of proposals on managing the shared last-level cache (LLC) of chip-multiprocessors (CMPs). However, most of these proposals primarily focus on reducing the amount of destructive interference between competing independent threads of multi-programmed workloads. While very few of these studies evaluate the proposed policies on shared memory multi-threaded applications, they do not improve constructive cross-thread sharing of data in the LLC In this paper, we characterize a set of multi-threaded applications drawn from the PARSEC, SPEC OMP, and SPLASH-2 suites with the goal of introducing sharing-awareness in LLC replacement policies. We motivate our characterization study by quantifying the potential contributions of the shared and the private blocks toward the overall volume of the LLC hits in these applications and show that the shared blocks are more important than the private blocks. Next, we characterize the amount of sharing-awareness enjoyed by recent proposals compared to the optimal policy. We design and evaluate a generic oracle that can be used in conjunction with any existing policy to quantify the potential improvement that can come from introducing sharing-awareness. The oracle analysis shows that introducing sharing-awareness reduces the number of LLC misses incurred by the least-recently-used (LRU) policy by 6% and 10% on average for a 4MB and 8MB LLC respectively. A realistic implementation of this oracle requires the LLC controller to have the capability to accurately predict, at the time a block is filled into the LLC, whether the block will be shared during its residency in the LLC. We explore the feasibility of designing such a predictor based on the address of the fill and the program counter of the instruction that triggers the fill. Our sharing behavior predictability study of two history-based fill-time predictors that use block addresses and program counters concludes that achieving acceptable levels of accuracy with such predictors will require other architectural and/or high-level program semantic features that have strong correlations with active sharing phases of the LLC blocks.
IEEE Transactions on Computers | 2003
Mainak Chaudhuri; Mark Heinrich; Chris Holt; Jaswinder Pal Singh; Edward Rothberg; John L. Hennessy
While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation-the latency it adds and the contention it induces-it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.