Taylan Yemliha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Taylan Yemliha is active.

Explore More

Publication

Featured researches published by Taylan Yemliha.

programming language design and implementation | 2010

Cache topology aware computation mapping for multicores

Mahmut T. Kandemir; Taylan Yemliha; SaiPrashanth Muralidhara; Shekhar Srikantaiah; Mary Jane Irwin; Yuanrui Zhnag

The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore machine and schedules the iterations assigned to each core. Our goal is to improve the utilization of the on-chip multi-layer cache hierarchy and to maximize overall application performance. We evaluate our cache topology aware approach using a set of twelve applications and three different commercial multicore machines. In addition, to study some of our experimental parameters in detail and to explore future multicore machines (with higher core counts and deeper on-chip cache hierarchies), we also conduct a simulation based study. The results collected from our experiments with three Intel multicore machines show that the proposed compiler-based approach is very effective in enhancing performance. In addition, our simulation results indicate that optimizing for the on-chip cache hierarchy will be even more important in future multicores with increasing numbers of cores and cache levels.

measurement and modeling of computer systems | 2011

Studying inter-core data reuse in multicores

Yuanrui Zhang; Mahmut T. Kandemir; Taylan Yemliha

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.

design automation conference | 2011

A helper thread based dynamic cache partitioning scheme for multithreaded applications

Mahmut T. Kandemir; Taylan Yemliha; Emre Kultursay

Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

international conference on parallel and distributed systems | 2006

SPM conscious loop scheduling for embedded chip multiprocessors

Liping Xue; Mahmut T. Kandemir; Guangyu Chen; Taylan Yemliha

One of the major factors that can potentially slow down widespread use of embedded chip multiprocessors is lack of efficient software support. In particular, automated code parallelizers are badly needed since it is not realistic to expect an average programmer to parallelize a large complex embedded application over multiple processors, taking into account several factors at the same time such as code density, data locality, performance, power and code resilience. Especially, increasing use of software-managed SPM (scratch-pad memory) components in embedded systems require an SPM conscious code parallelization. Motivated by this observation, this paper proposes a novel compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. This strategy tries to achieve two objectives. First, the sets of loop iterations assigned to different processors should approximately take the same amount of time to finish. Second, the set of iterations assigned to a processor should exhibit high data reuse. Satisfying these two objectives help us to minimize parallel execution time of the application at hand. The specific method adopted by our scheduling strategy to achieve these objectives is to distribute loop iterations across parallel processors in an SPM conscious manner. In this strategy, the compiler analyzes the loop, identifies the potential SPM hits and misses, and distributes loop iterations over processors such that the processors have more or less the same execution time. Our experimental results so far indicate that the proposed approach generates much better results than existing loop schedulers. Specifically, it brings 18.9%, 22.4%, and 11.1% improvements in parallel execution time (with a chip multiprocessor of 8 cores) over a previously proposed static scheduler, a dynamic scheduler, and an alternate locality-conscious scheduler, respectively

international conference on computer aided design | 2008

SPM management using Markov chain based data access prediction

Taylan Yemliha; Shekhar Srikantaiah; Mahmut T. Kandemir; Ozcan Ozturk

Leveraging the power of scratchpad memories (SPMs) available in most embedded systems today is crucial to extract maximum performance from application programs. While regular accesses like scalar values and array expressions with affine subscript functions have been tractable for compiler analysis (to be prefetched into SPM), irregular accesses like pointer accesses and indexed array accesses have not been easily amenable for compiler analysis. This paper presents an SPM management technique using Markov chain based data access prediction for such irregular accesses. Our approach takes advantage of inherent, but hidden reuse in data accesses made by irregular references. We have implemented our proposed approach using an optimizing compiler. In this paper, we also present a thorough comparison of our different dynamic prediction schemes with other SPM management schemes. SPM management using our approaches produces 12.7% to 28.5% improvements in performance across a range of applications with both regular and irregular access patterns, with an average improvement of 20.8%.

symposium on code generation and optimization | 2011

Neighborhood-aware data locality optimization for NoC-based multicores

Mahmut T. Kandemir; Yuanrui Zhang; Jun Liu; Taylan Yemliha

Data locality optimization is a critical issue for NoC (network-on-chip) based multicore systems. In this paper, focusing on a two-dimensional NoC-based multicore and dataintensive multithreaded applications, we first discuss a data locality aware scheduling algorithm for any given computation-to-core mapping, and then propose an integrated mapping+scheduling algorithm that performs both tasks together. Both our algorithms consider temporal (time-wise) and spatial (neighborhood-aware) data reuse, and try to minimize distance-to-data in on-chip cache accesses. We test the effectiveness of our compiler algorithms using a set of twelve application programs. Our experiments indicate that the proposed algorithms achieve significant improvements in data access latencies (42.7% on average) and overall execution times (24.1% on average). We also conduct a sensitivity analysis where we change the number of cores, on-chip cache capacities, and data movement (migration) strategies. These experiments show that our proposed algorithms generate consistently good results.

design, automation, and test in europe | 2007

Memory bank aware dynamic loop scheduling

Mahmut T. Kandemir; Taylan Yemliha; Seung Woo Son; Ozcan Ozturk

In a parallel system with multiple CPUs, one of the key problems is to assign loop iterations to processors. This problem, known as the loop scheduling problem, has been studied in the past, and several schemes, both static and dynamic, have been proposed. One of the attractive features of dynamic schemes, as compared to their static counterparts, is their ability of exploiting the latency variations across the execution times of the different loop iterations. In all the dynamic loop scheduling techniques proposed in literature so far, performance has been the primary metric of interest. In a battery-operated embedded execution environment, however, power consumption is another metric to consider during iteration-to-processor assignment. In particular, in a banked memory system, this assignment can have an important impact on memory power consumption, which can be a significant portion of the overall energy consumption, especially for data-intensive embedded applications such as those from the domain of image data processing. This paper presents a bank aware dynamic loop scheduling scheme for array-intensive embedded media applications. The goal behind this new scheduling scheme is to minimize the number of memory banks that need to be used for executing the current working set (group of loop iterations) when all processors are considered together. That is, during the loop iteration-to-processor assignment, our approach considers the bank access patterns of loop iterations and carefully selects the set of iterations to assign to an idle processor so that, if possible, the number of memory banks that are used at the current state is not increased. Our experimental results show that the proposed scheduling scheme leads to much better energy results when compared to prior loop scheduling techniques and it is also competitive with the scheduler that generates the best performance. To our knowledge, this is the first dynamic loop scheduling scheme that is memory bank aware.

international conference on computer aided design | 2008

Integrated code and data placement in two-dimensional mesh based chip multiprocessors

Taylan Yemliha; Shekhar Srikantaiah; Mahmut T. Kandemir; Mustafa Karaköy; Mary Jane Irwin

As transistor sizes continue to shrink and the number of transistors per chip keeps increasing, chip multiprocessors (CMPs) are becoming a promising alternative to remain on the current performance trajectory for both high-end systems and embedded systems. Since future technologies offer the promise of being able to integrate billions of transistors on a chip, the prospects of having hundreds to thousands of processors on a single chip along with an underlying memory hierarchy and an interconnection system is entirely feasible. This paper proposes a compiler directed integrated code and data placement scheme for two-dimensional mesh based CMP architectures. The proposed approach uses a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data and then assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. During the mapping process, the on-chip memory capacity and load imbalance across different cores and the topology of the NoC are taken into account. In this paper, we present two variants of our approach: depth-first placement (DFP) and breadth-first placement (BFP), and compare them to three alternate code/data mapping schemes. The experimental evaluation shows that our CDAG based placement schemes are very successful in practice, achieving average performance improvements of 19.9% (DFP) and 16.8% (BFP), and average energy improvements of 29.7% (DFP) and 27.8% (BFP).

international conference on vlsi design | 2007

Compiler-Directed Code Restructuring for Operating with Compressed Arrays

Taylan Yemliha; Guangyu Chen; Ozcan Ozturk; Mahmut T. Kandemir; Vijay Degalahal

Memory system utilization is an important issue for many embedded systems that operate under tight memory limitations. This is a strong motivation for recent research on reducing the number of banks required during execution of a given application. Reducing memory space requirements of an application can bring three potential benefits. First, if we are to design a customized memory system for a given embedded application, reducing its memory requirements can cut the overall cost. Second, if we are to execute our application in a multi-programmed environment, the saved memory space can be used by other applications, thereby increasing the degree of multiprogramming. Third, it is also possible to reduce the energy consumption in a banked memory system by reducing the amount of memory space occupied by application data and placing the unused banks into low-power operating modes. This paper proposes a novel memory saving strategy for array-dominated embedded applications. The idea is to exploit the value locality in array data (e.g., those from the multi-media domain) by storing arrays in a compressed form to save memory space. Based on the compressed forms of the input arrays, our compiler-based approach automatically determines the compressed forms of the intermediate and output arrays and also automatically restructures the application code so that we can also reduce execution time (by exploiting value locality). Our experimental results show that this scheme is very effective in practice and reduces the memory space requirements of the applications tested by 19% on average. The experimental results also show that the proposed approach reduces the execution cycles of the original codes by 14% on average

cluster computing and the grid | 2012

On Urgency of I/O Operations

Mahmut T. Kandemir; Taylan Yemliha; Ramya Prabhakar; Myoungsoo Jung

Many high-performance parallel file systems and storage hierarchies employ multilayer storage caches in an attempt to reduce data access latencies. In current storage cache hierarchies, all data requests are treated uniformly and hit/miss characteristics are dictated only by the degree of reuse exhibited by data blocks. In reality however, different I/O operations may have different urgencies (criticalities), and in particular, some I/O operations can be delayed without having a major impact on overall application performance. Motivated by this observation, we define the concept of I/O operation urgency (criticality) and study the critical latencies of I/O operations for a set of seven high-performance applications that manipulate disk-resident data sets. We propose and experimentally evaluate three profile-based strategies for exploiting urgent I/O operations in managing storage caches. The results collected with these schemes on both two-tier and three-tier systems indicate that significant performance improvements are possible if one could exploit urgencies of different I/O operations in managing storage caches.

Explore More