Emre Kultursay | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Emre Kultursay is active.

Explore More

Publication

Featured researches published by Emre Kultursay.

international symposium on performance analysis of systems and software | 2013

Evaluating STT-RAM as an energy-efficient main memory alternative

Emre Kultursay; Mahmut T. Kandemir; Anand Sivasubramaniam; Onur Mutlu

In this paper, we explore the possibility of using STT-RAM technology to completely replace DRAM in main memory. Our goal is to make STT-RAM performance comparable to DRAM while providing substantial power savings. Towards this goal, we first analyze the performance and energy of STT-RAM, and then identify key optimizations that can be employed to improve its characteristics. Specifically, using partial write and row buffer write bypass, we show that STT-RAM main memory performance and energy can be significantly improved. Our experiments indicate that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.

IEEE Micro | 2013

Steep-Slope Devices: From Dark to Dim Silicon

Karthik Swaminathan; Emre Kultursay; Vinay Saripalli; Vijaykrishnan Narayanan; Mahmut T. Kandemir; Suman Datta

Although the superior subthreshold characteristics of steep-slope devices can help power up more cores, researchers still need CMOS technology to accelerate sequential applications, because it can reach higher frequencies. Device-level heterogeneous multicores can give the best of both worlds, but they need smart resource management to realize this promise. In this article, the authors discuss device-level heterogeneous multicores and various resource-management schemes for reaching higher energy efficiency.

high-performance computer architecture | 2011

MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy

Shekhar Srikantaiah; Emre Kultursay; Tao Zhang; Mahmut T. Kandemir; Mary Jane Irwin; Yuan Xie

Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist on the same architecture. Starting from per-core L2 and L3 cache slices as the basic design point, MorphCache alters the cache topology dynamically by merging or splitting cache slices and modifying the accessibility of different cache slice groups to different cores in a CMP. We evaluated MorphCache on a 16 core CMP on a full system simulator and found that it significantly improves both average throughput and harmonic mean of speedups of diverse multithreaded and multiprogrammed workloads. Specifically, our results show that MorphCache improves throughput of the multiprogrammed mixes by 29.9% over a topology with all-shared L2 and L3 caches and 27.9% over a topology with per core private L2 cache and shared L3 cache. In addition, we also compared MorphCache to partitioning a single shared cache at each level using promotion/insertion pseudo-partitioning (PIPP) [28] and managing per-core private cache at each level using dynamic spill receive caches (DSR) [18]. We found that MorphCache improves average throughput by 6.6% over PIPP and by 5.7% over DSR when applied to both L2 and L3 caches.

international conference on hardware/software codesign and system synthesis | 2012

Performance enhancement under power constraints using heterogeneous CMOS-TFET multicores

Emre Kultursay; Karthik Swaminathan; Vinay Saripalli; Vijaykrishnan Narayanan; Mahmut T. Kandemir; Suman Datta

Device level heterogeneity promises high energy efficiency over a larger range of voltages than a single device technology alone can provide. In this paper, starting from device models, we first present ground-up modeling of CMOS and TFET cores, and verify this model against existing processors. Using our core models, we construct a 32-core TFET-CMOS heterogeneous multicore. We then show that it is a very challenging task to identify the ideal runtime configuration to use in such a heterogeneous multicore, which includes finding the best number/type of cores to activate and the corresponding voltages/frequencies to select for these cores. In order to effectively utilize this heterogeneous processor, we propose a novel automated runtime scheme. Our scheme is designed to automatically improve the performance of applications running on heterogeneous CMOS-TFET multicores operating under a fixed power budget, without requiring any effort from the application programmer or the user. Our scheme combines heterogeneous thread-to-core mapping, dynamic work partitioning, and dynamic power partitioning to identify energy efficient operating points. With simulations we show that our runtime scheme can enable a CMOS-TFET multicore to serve a diversity of workloads with high energy efficiency and achieve 21% average speedup over the best performing equivalent homogeneous multicore.

international symposium on low power electronics and design | 2011

Improving energy efficiency of multi-threaded applications using heterogeneous CMOS-TFET multicores

Karthik Swaminathan; Emre Kultursay; Vinay Saripalli; Vijaykrishnan Narayanan; Mahmut T. Kandemir; Suman Datta

Energy-Delay-Product-aware DVFS is a widely-used technique that improves energy efficiency by dynamically adjusting the frequencies of cores. Further, for multithreaded applications, barrier-aware DVFS is a method that can dynamically tune the frequencies of cores to reduce barrier stall times and achieve higher energy efficiency. In both forms of DVFS, frequencies of cores are reduced from the maximum value to achieve better energy efficiency. TFET devices operate at energy efficiencies that cannot be achieved by CMOS devices. This advantage of TFET devices can be exploited in the context of multicore processors by replacing some of the CMOS cores with energy efficient TFET alternatives. However, the energy benefits of TFET devices are observed at relatively lower voltages, which results in a degradation in performance due to executing at lower frequencies. Although applications cannot be limited to run always at such lower frequencies, it can be significantly beneficial from an energy efficiency perspective to make use of energy efficient TFET cores during the times applications spend at these frequencies. In this paper, we show that due to EDP-aware DVFS and barrier-aware DVFS, multithreaded applications run for a significant portion of their execution time at frequencies at which TFET cores are more energy efficient. We further show that, at those frequencies, dynamically migrating threads to TFET cores can achieve average leakage and dynamic energy savings of 30% and 17%, respectively, with a performance degradation of less than 1%.

international symposium on microarchitecture | 2012

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Akbar Sharifi; Emre Kultursay; Mahmut T. Kandemir; Chita R. Das

To achieve high performance in emerging multicores, it is crucial to reduce the number of memory accesses that suffer from very high latencies. However, this should be done with care as improving latency of an access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of memory accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies. Our first scheme prioritizes memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the memory banks. These two network prioritization-based optimizations together lead to uniform memory access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on memory intensive, memory non-intensive, and mixed multiprogrammed workloads, respectively.

international conference on parallel architectures and compilation techniques | 2013

Meeting midway: improving CMP performance with memory-side prefetching

Praveen Yedlapalli; Jagadish Kotra; Emre Kultursay; Mahmut T. Kandemir; Chita R. Das; Anand Sivasubramaniam

Both on-chip resource contention and off-chip latencies have a significant impact on memory requests in large-scale chip multiprocessors. We propose a memory-side prefetcher, which brings data on-chip from DRAM, but does not proactively further push this data to the cores/caches. Sitting close to memory, it avails close knowledge of DRAM state and memory channels to leverage DRAM row buffer locality and channel state to bring data (from the current row buffer) on-chip ahead of need. This not only reduces the number of off-chip accesses for demand requests, but also reduces row buffer conflicts, effectively improving DRAM access times. At the same time, our prefetcher maintains this data in a small buffer at each memory controller instead of pushing it into the caches to avoid on-chip resource contention. We show that the proposed memory-side prefetcher outperforms a state-of-the-art core-side prefetcher and an existing memory-side prefetcher. More importantly, our prefetcher can also work in tandem with the core-side prefetcher to amplify the benefits. Using a wide range of multiprogrammed and multi-threaded workloads, we show that this memory-side prefetcher provides IPC improvements of 6.2% (maximum of 33.6%), and 10% (maximum of 49.6%), on an average when running alone and when combined with a core-side prefetcher, respectively. By meeting requests midway, our solution reduces the off-chip latencies while avoiding the on-chip resource contention caused by inaccurate and ill-timed prefetches.

design automation conference | 2011

A helper thread based dynamic cache partitioning scheme for multithreaded applications

Mahmut T. Kandemir; Taylan Yemliha; Emre Kultursay

Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

european conference on parallel processing | 2010

Scalable parallelization strategies to accelerate NuFFT data translation on multicores

Yuanrui Zhang; Jun Liu; Emre Kultursay; Mahmut T. Kandemir; Nikos P. Pitsianis; Xiaobai Sun

The non-uniform FFT (NuFFT) has been widely used in many applications. In this paper, we propose two new scalable parallelization strategies to accelerate the data translation step of the NuFFT on multicore machines. Both schemes employ geometric tiling and binning to exploit data locality, and use recursive partitioning and scheduling with dynamic task allocation to achieve load balancing. The experimental results collected from a commercial multicore machine show that, with the help of our parallelization strategies, the data translation step is no longer the bottleneck in the NuFFT computation, even for large data set sizes, with any input sample distribution.

programming language design and implementation | 2015

Optimizing off-chip accesses in multicores

Wei Ding; Xulong Tang; Mahmut T. Kandemir; Yuanrui Zhang; Emre Kultursay

In a network-on-chip (NoC) based manycore architecture, an off-chip data access (main memory access) needs to travel through the on-chip network, spending considerable amount of time within the chip (in addition to the memory access latency). In addition, it contends with on-chip (cache) accesses as both use the same NoC resources. In this paper, focusing on data-parallel, multithreaded applications, we propose a compiler-based off-chip data access localization strategy, which places data elements in the memory space such that an off-chip access traverses a minimum number of links (hops) to reach the memory controller that handles this access. This brings three main benefits. First, the network latency of off-chip accesses gets reduced; second, the network latency of on-chip accesses gets reduced; and finally, the memory latency of off-chip accesses improves, due to reduced queue latencies. We present an experimental evaluation of our optimization strategy using a set of 13 multithreaded application programs under both private and shared last-level caches. The results collected emphasize the importance of optimizing the off-chip data accesses.

Explore More