Is this you? Create Your Porfile

Suleyman Sair

North Carolina State University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Suleyman Sair is active.

Explore More

Publication

Featured researches published by Suleyman Sair.

international symposium on computer architecture | 2003

Phase tracking and prediction

Timothy Sherwood; Suleyman Sair; Brad Calder

In a single second a modern processor can execute billions of instructions. Obtaining a birds eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.In this paper, we present a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales. By examining the proportion of instructions that were executed from different sections of code, we can find generic phases that correspond to changes in behavior across many metrics. By classifying phases generically, we avoid the need to identify phases for each optimization, and enable a unified prediction scheme that can forecast future behavior. Our analysis shows that our design can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.

international symposium on microarchitecture | 2003

Discovering and exploiting program phases

Timothy Sherwood; Erez Perelman; Greg Hamerly; Suleyman Sair; Brad Calder

Understanding program behavior is at the foundation of computer architecture and program optimization. Many programs have wildly different behavior on even the largest of scales (that is, over the programs complete execution). During one part of the execution, a program can be completely memory bound; in another, it can repeatedly stall on branch mispredicts. Average statistics gathered about a program might not accurately picture where the real problems lie. This realization has ramifications for many architecture and compiler techniques, from how to best schedule threads on a multithreaded machine, to feedback-directed optimizations, power management, and the simulation and test of architectures. Taking advantage of time-varying behavior requires a set of automated analytic tools and hardware techniques that can discover similarities and changes in program behavior on the largest of time scales. The challenge in building such tools is that during a programs lifetime it can execute billions or trillions of instructions. How can high-level behavior be extracted from this sea of instructions? Some programs change behavior drastically, switching between periods of high and low performance, yet system design and optimization typically focus on average system behavior. It is argued that instead of assuming average behavior, it is now time to model and optimize phase-based program behavior.

international symposium on microarchitecture | 2002

Pointer cache assisted prefetching

Jamison D. Collins; Suleyman Sair; Brad Calder; Dean M. Tullsen

Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointers effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the loads dependencies are broken by using the pointer cache hit as a value prediction. We also examine using the pointer cache to allow speculative precomputation to run farther ahead of the main thread of execution than in prior studies. Previously proposed thread-based prefetchers are limited in how far they can run ahead of the main thread when traversing a chain of recurrent dependent loads. When combined with the pointer cache, a speculative thread can make better progress ahead of the main thread, rapidly traversing data structures in the face of cache misses caused by pointer transitions.

IEEE Transactions on Computers | 2003

A decoupled predictor-directed stream prefetching architecture

Suleyman Sair; Timothy Sherwood; Brad Calder

An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of hardware-based data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. In this paper, we propose predictor-directed stream buffers (PSB), which allows the stream buffer to follow a general address prediction stream instead of a fixed stride. A general address prediction stream complicates the allocation of both stream buffer and memory resources because the predictions generated will not be as reliable as prior sequential next-line and stride-based stream buffer implementations. To address this, we examine using confidence-based techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show, when using PSB on a benchmark suite heavy in pointer-based applications, that PSB provides a 23 percent speedup on average over the best previous stream buffer implementation, and an improvement of 75 percent over using no prefetching at all.

high-performance computer architecture | 2003

Catching accurate profiles in hardware

Satish Narayanasamy; Timothy Sherwood; Suleyman Sair; Brad Calder; George Varghese

Run-time optimization is one of the most important ways of getting performance out of modern processors. Techniques such as prefetching, trace caching, memory disambiguation etc., are all based upon the principle of observation followed by adaptation, and all make use of some sort of profile information gathered at run-time. Programs are very complex, and the real trick in generating useful run-time profiles is sifting through all the unimportant and infrequently occurring events to find those that are important enough to warrant optimization. In this paper, we present the multi-hash architecture to catch important events even in the presence of extensive noise. Multi-hash uses a small amount of area, between 7 to 16 Kilo-bytes, to accurately capture these important events in hardware, without requiring any software support. This is achieved using multiple hash tables for the filtering, and interval-based profiling to help identify how important an event is in relationship to all the other events. We evaluate our design for value and edge profiling, and show that over a set of benchmarks, we get an average error less than 1%.

high-performance computer architecture | 2002

Quantifying load stream behavior

Suleyman Sair; Timothy Sherwood; Brad Calder

The increasing performance gap between processors and memory will force future architectures to devote significant resources towards removing and hiding memory latency. The two major architectural features used to address this growing gap are caches and prefetching. In this paper we perform a detailed quantification of the cache miss patterns for the Olden benchmarks, SPEC 2000 benchmarks, and a collection of pointer based applications. We classify misses into one of four categories corresponding to the type of access pattern. These are next-line, stride, same-object (additional misses that occur to a recently accessed object), or pointer-based transitions. We then propose and evaluate a hardware profiling architecture to correctly identify which type of access pattern is being seen. This access pattern identification could be used to help guide and allocate prefetching resources, and provide information to feedback-directed optimizations. A second goal of this paper is to identify a suite of challenging pointer-based benchmarks that can be used to focus the development of new software and hardware prefetching algorithms, and identify the challenges in performing prefetching for these applications using new metrics.

international conference on hardware/software codesign and system synthesis | 2005

Designing real-time H.264 decoders with dataflow architectures

Suleyman Sair; Youngsoo Richard Kim

High performance microprocessors are designed with general-purpose applications in mind. When it comes to embedded applications, these architectures typically perform control-intensive tasks in a System-on-Chip (SoC) design. But they are significantly inefficient for data-intensive tasks such as video encoding/decoding. Although configurable processors fill this gap by complementing the existing functional units with instruction extensions, their performance lags behind the needs of real-time embedded tasks. In this paper, we evaluate the performance potential of a dataflow processor for H.264 video decoding. We first profile the H.264 application to capture the amount of data traffic among modules. We use this information to guide the placement of H.264 modules in the WaveScalar dataflow architecture. A simulated annealing based placement algorithm produces the final placement aiming to optimize the communication costs between the modules in the dataflow architecture. In addition to outperforming contemporary embedded and customized processors, our simulated annealing guided design shows a speedup of 13% in execution time over the original WaveScalar architecture. With our dataflow design methodology, emerging embedded applications requiring several GOPS to meet real-time constraints can be drafted within a reasonable amount of design time.

international conference on computer design | 2005

Reducing the latency and area cost of core swapping through shared helper engines

Anahita Shayesteh; Eren Kursun; Timothy Sherwood; Suleyman Sair; Glenn Reinman

Technologies scaling trends and the limitations of packaging and cooling have intensified the need for thermally efficient architectures and architecture-level temperature management techniques. To combat these trends, we explore the use of core swapping on microcore architecture, a deeply decoupled processor core with larger structures factored out as helper engines. The microcore architecture presents an ideal platform for core swapping thanks to helper engines that maintain the state of each process in a shared fabric surrounding the cores, reducing the impact of core swapping 43% on average while showing promising thermal reduction. It also has favorable performance when compared to other thermal management techniques. Furthermore, we evaluate alternative approaches to spending the area overhead of the additional microcore, including larger microcores, CMP cores, and SMT cores with different thermal management techniques.

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems | 2004

Low-overhead core swapping for thermal management

Eren Kursun; Glenn Reinman; Suleyman Sair; Anahita Shayesteh; Timothy Sherwood

Technology scaling trends and the limitations of packaging and cooling have intensified the need for thermally efficient architectures and architecture-level temperature management techniques. To combat these trends, we evaluate the thermal efficiency of the microcore architecture, a deeply decoupled processor core with larger structures factored out as helper engines. We further investigate activity migration (core swapping) as a means of controlling the thermal profile of the chip in this study. Specifically, the microcore architecture presents an ideal platform for core swapping thanks to helper engines that maintain the state of each process in a shared fabric surrounding the cores. This results in significantly reduced migration overhead, enabling seamless swapping of cores. Our results show that our thermal mechanisms outperform traditional Dynamic Thermal Management (DTM) techniques by reducing the performance hit caused by slowing/swapping of cores. Our experimental results show that the microcore architecture has 86% fewer thermally critical cycles compared to a conventional monolithic core.

ACM Sigarch Computer Architecture News | 2005

Dynamically configurable shared CMP helper engines for improved performance

Anahita Shayesteh; Glenn Reinman; Norman P. Jouppi; Suleyman Sair; Timothy Sherwood

Technology scaling trends have forced designers to consider alternatives to deeply pipelining aggressive cores with large amounts of performance accelerating hardware. One alternative is a small, simple core that can be augmented with latency tolerant helper engines. As the demands placed on the processor core varies between applications, and even between phases of an application, the benefit seen from any set of helper engines will vary tremendously. If there is a single core, these auxiliary structures can be turned on and off dynamically to tune the energy/performance of the machine to the needs of the running application.As more of the processor is broken down into helper engines, and as we add more and more cores onto a single chip which can potentially share helpers, the decisions that are made about these structures become increasingly important. In this paper we describe the need for methods that effectively manage these helper engines. Our counter-based approach can dynamically turn off 3 helpers on average, while staying within 2% of the performance when running with all helpers. In a multicore environment, our intelligent and flexible sharing of helper engines, provides an average 24% speedup over static sharing in conjoined cores. Furthermore we show benefit from constructively sharing helper engines among multiple cores running the same application.

Explore More