M. Aater Suleman
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by M. Aater Suleman.
architectural support for programming languages and operating systems | 2009
M. Aater Suleman; Onur Mutlu; Moinuddin K. Qureshi; Yale N. Patt
Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an asymmetric chip multiprocessor (ACMP), which can execute them faster than the smaller cores can.
architectural support for programming languages and operating systems | 2008
M. Aater Suleman; Moinuddin K. Qureshi; Yale N. Patt
Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power. This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.
international conference on parallel architectures and compilation techniques | 2010
M. Aater Suleman; Moinuddin K. Qureshi; Khubaib; Yale N. Patt
Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.
international symposium on computer architecture | 2013
José A. Joao; M. Aater Suleman; Onur Mutlu; Yale N. Patt
Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread scheduling and fine-grained bottleneck acceleration. Unfortunately, there have been no proposals offered thus far to decide which code segments to accelerate in cases where both coarse-grained thread scheduling and fine-grained bottleneck acceleration could have value. This paper proposes Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs (UBA), a cooperative software/hardware mechanism for identifying and accelerating the most likely critical code segments from a set of multithreaded applications running on an ACMP. The key idea is a new Utility of Acceleration metric that quantifies the performance benefit of accelerating a bottleneck or a thread by taking into account both the criticality and the expected speedup. UBA outperforms the best of two state-of-the-art mechanisms by 11% for single application workloads and by 7% for two-application workloads on an ACMP with 52 small cores and 3 large cores.
international symposium on computer architecture | 2010
M. Aater Suleman; Onur Mutlu; José A. Joao; Khubaib; Yale N. Patt
Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SEs benefit is limited because most segments access inter-segment data, i.e., data generated by the previous segment. When consecutive segments run on different cores, accesses to inter-segment data incur cache misses, thereby reducing performance. This paper proposes Data Marshaling (DM), a new technique to eliminate cache misses to inter-segment data. DM uses profiling to identify instructions that generate inter-segment data, and adds only 96 bytes/core of storage overhead. We show that DM significantly improves the performance of two promising Staged Execution models, Accelerated Critical Sections and producer-consumer pipeline parallelism, on both homogeneous and heterogeneous multi-core systems. In both models, DM can achieve almost all of the potential of ideally eliminating cache misses to inter-segment data. DMs performance benefit increases with the number of cores.
international symposium on microarchitecture | 2010
M. Aater Suleman; Onur Mutlu; Moinuddin K. Qureshi; Yale N. Patt
Contention for critical sections can reduce performance and scalability by causing thread serialization. The proposed accelerated critical sections mechanism reduces this limitation. ACS executes critical sections on the high-performance core of an asymmetric chip multiprocessor (ACMP), which can execute them faster than the smaller cores can.
symposium on code generation and optimization | 2006
Hyesoon Kim; M. Aater Suleman; Onur Mutlu; Yale N. Patt
Static compilers use profiling to predict run-time program behavior. Generally, this requires multiple input sets to capture wide variations in run-time behavior. This is expensive in terms of resources and compilation time. We introduce a new mechanism, 2D-profiling, which profiles with only one input set and predicts whether the result of the profile would change significantly across multiple input sets. We use 2D-profiling to predict whether a branchs prediction accuracy varies across input sets. The key insight is that if the prediction accuracy of an individual branch varies significantly over a profiling run with one input set, then it is more likely that the prediction accuracy of that branch varies across input sets. We evaluate 2D-profiling with the SPEC CPU 2000 integer benchmarks and show that it can identify input-dependent branches accurately.
international conference on ic design and technology | 2009
Baker Mohammad; Muhammad Tauseef Rab; Khadir Mohammad; M. Aater Suleman
Dynamic cache resizing coupled with Built In Self Test (BIST) is proposed to enhance yield of SRAM-based cache memory. BIST is used as part of the power-up sequence to identify the faulty memory addresses. Logic is added to prevent access to the identified locations, effectively reducing the cache size. Cache resizing approach can solve for as many faulty locations as the end user would like, while trading off on performance. Reliability and long term effect on memory such as pMOS NBTI issue is also compensated for by running BIST and implementing cache resizing architecture, hence detecting faults introduced over time. Since memory soft failures are worst at lower voltage operation dynamic cache resizing can be used to tradeoff power for performance. This approach supplements existing design time optimizations and adaptive design techniques used to enhance memory yield. Performance loss incurred due to the cache reduction is determined to be within 1%.
architectural support for programming languages and operating systems | 2012
José A. Joao; M. Aater Suleman; Onur Mutlu; Yale N. Patt
international symposium on microarchitecture | 2012
Khubaib; M. Aater Suleman; Milad Hashemi; Chris Wilkerson; Yale N. Patt