Antonia Zhai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antonia Zhai is active.

Explore More

Publication

Featured researches published by Antonia Zhai.

international symposium on computer architecture | 2000

A scalable approach to thread-level speculation

J. Greggory Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this paper we propose and evaluate a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger.

ACM Transactions on Computer Systems | 2005

The STAMPede approach to thread-level speculation

J. Gregory Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.

architectural support for programming languages and operating systems | 2002

Compiler optimization of scalar value communication between speculative threads

Antonia Zhai; Christopher B. Colohan; J. Gregory Steffan; Todd C. Mowry

While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2-28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.

high-performance computer architecture | 2002

Improving value communication for thread-level speculation

J.G. Steffan; Christopher B. Colohan; Antonia Zhai; Todd C. Mowry

Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this paper, we show that the key to good performance ties in the three different ways to communicate a value between speculative threads: speculation, synchronization and prediction. The difficult part is deciding how and when to apply each method. This paper shows how we can apply value prediction, dynamic synchronization and hardware instruction prioritization to improve value communication and hence performance in several SPECint benchmarks that have been automatically transformed by our compiler to exploit TLS. We find that value prediction can be effective when properly throttled to avoid the high costs of mis-prediction, while most of the gains of value prediction can be more easily achieved by exploiting silent stores. We also show that dynamic synchronization is quite effective for most benchmarks, while hardware instruction prioritization is not. Overall, we find that these techniques have great potential for improving the performance of TLS.

international symposium on computer architecture | 2013

Triggered instructions: a control paradigm for spatially-programmed architectures

Angshuman Parashar; Michael Pellauer; Michael Adler; Bushra Ahsan; Neal Clayton Crago; Daniel Lustig; Vladimir Pavlov; Antonia Zhai; Mohit Gambhir; Aamer Jaleel; Randy L. Allmon; Rachid Rayess; Stephen Maresh; Joel S. Emer

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.

design, automation, and test in europe | 2011

Enabling improved power management in multicore processors through clustered DVFS

Tejaswini Kolpe; Antonia Zhai; Sachin S. Sapatnekar

In recent years, chip multiprocessors (CMP) have emerged as a solution for high-speed computing demands. However, power dissipation in CMPs can be high if numerous cores are simultaneously active. Dynamic voltage and frequency scaling (DVFS) is widely used to reduce the active power, but its effectiveness and cost depends on the granularity at which it is applied. Per-core DVFS allows the greatest flexibility in controlling power, but incurs the expense of an unrealistically large number of on-chip voltage regulators. Per-chip DVFS, where all cores are controlled by a single regulator overcomes this problem at the expense of greatly reduced flexibility. This work considers the problem of building an intermediate solution, clustering the cores of a multicore processor into DVFS domains and implementing DVFS on a per-cluster basis. Based on a typical workload, we propose a scheme to find similarity among the cores and cluster them based on this similarity. We also provide an algorithm to implement DVFS for the clusters, and evaluate the effectiveness of per-cluster DVFS in power reduction.

languages and compilers for parallel computing | 2005

Loop selection for thread-level speculation

Shengyue Wang; Xiaoru Dai; Kiran Yellajyosula; Antonia Zhai; Pen Chung Yew

Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication makes it difficult to obtain the desired performance unless the speculative threads are carefully chosen. In this paper, we focus on extracting parallel threads from loops in general-purpose applications because loops, with their regular structures and significant coverage on execution time, are ideal candidates for extracting parallel threads. General-purpose applications, however, usually contain a large number of nested loops with unpredictable parallel performance and dynamic behavior, thus making it difficult to decide which set of loops should be parallelized to improve overall program performance. Our proposed loop selection algorithm addresses all these difficulties. We have found that (i) with the aid of profiling information, compiler analyses can achieve a reasonably accurate estimation of the performance of parallel execution, and that (ii) different invocations of a loop may behave differently, and exploiting this dynamic behavior can further improve performance. With a judicious choice of loops, we can improve the overall program performance of SPEC2000 integer benchmarks by as much as 20%.

symposium on code generation and optimization | 2004

Compiler optimization of memory-resident value communication between speculative threads

Antonia Zhai; Christopher B. Colohan; J. Gregory Steffan; Todd C. Mowry

Efficient inter-thread value communication is essential for improving performance in thread-level speculation (TLS). Although several mechanisms for improving value communication using hardware support have been proposed, there is relatively little work on exploiting the potential of compiler optimization. Building on recent research on compiler optimization of scalar value communication between speculative threads, we propose compiler techniques for the optimization of memory-resident values. In TLS, data dependences through memory-resident values are tracked by the underlying hardware and preserved by re-executing any speculative thread that violates a dependence; however, re-execution incurs a large performance penalty and should be used only to resolve data dependences that are infrequent. In contrast, value communication for frequently-occurring data dependences must be very efficient. We propose using the compiler to first identify frequently-occurring memory-resident data dependences, then insert synchronization for communicating values to preserve these dependences. We find that by synchronizing frequently-occurring data dependences we can significantly improve the efficiency of parallel execution. A comparison between compiler-inserted and hardware-inserted memory synchronization reveals that the two techniques are complementary, with each technique benefitting different benchmarks.

international symposium on performance analysis of systems and software | 2009

Exploring speculative parallelism in SPEC2006

Venkatesan Packirisamy; Antonia Zhai; Wei-Chung Hsu; Pen Chung Yew; Tin Fook Ngai

The computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000s. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques such as Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and shown to be moderately successful. Preliminary work examining the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus only showed limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing to SPEC2000. We further use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found that with optimal loop selection we can potentially achieve an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intels ICC compiler.We also found that an additional 11% improvement can be potentially obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting to a single loop level.

international symposium on computer architecture | 2009

Dynamic performance tuning for speculative threads

Yangchun Luo; Venkatesan Packirisamy; Wei-Chung Hsu; Antonia Zhai; Nikhil Mungre; Ankit Tarkas

In response to the emergence of multicore processors, various novel and sophisticated execution models have been introduced to fully utilize these processors. One such execution model is Thread-Level Speculation (TLS), which allows potentially dependent threads to execute speculatively in parallel. While TLS offers significant performance potential for applications that are otherwise non-parallel, extracting efficient speculative threads in the presence of complex control flow and ambiguous data dependences is a real challenge. This task is further complicated by the fact that the performance of speculative threads is often architecture-dependent, input-sensitive, and exhibits phase behaviors. Thus we propose dynamic performance tuning mechanisms that determine where and how to create speculative threads at runtime. This paper describes the design, implementation, and evaluation of hardware and software support that takes advantage of runtime performance profiles to extract efficient speculative threads. In our proposed framework, speculative threads are monitored by hardware-based performance counters and their performance impact is estimated. The creation of speculative threads is adjusted based on the estimation. This paper proposes speculative threads performance estimation techniques, that are capable of correctly determining whether speculation can improve performance for loops that corresponds to 83.8% of total loop execution time across all benchmarks. This paper also examines several dynamic performance tuning policies and finds that the best tuning policy achieves an overall speedup of 36.8%on a set of benchmarks from SPEC2000 suite, which outperforms static thread management by 9.5%.

Explore More