Alain Kagi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alain Kagi is active.

Explore More

Publication

Featured researches published by Alain Kagi.

IEEE Computer | 2005

Intel virtualization technology

Rich Uhlig; Gil Neiger; Dion Rodgers; Amy L. Santoni; Fernando C. M. Martins; Andrew V. Anderson; Steven M. Bennett; Alain Kagi; Felix Leung; Larry Smith

A virtualized system includes a new layer of software, the virtual machine monitor. The VMMs principal role is to arbitrate accesses to the underlying physical host platforms resources so that multiple operating systems (which are guests of the VMM) can share them. The VMM presents to each guest OS a set of virtual platform interfaces that constitute a virtual machine (VM). Once confined to specialized, proprietary, high-end server and mainframe systems, virtualization is now becoming more broadly available and is supported in off-the-shelf systems based on Intel architecture (IA) hardware. This development is due in part to the steady performance improvements of IA-based systems, which mitigates traditional virtualization performance overheads. Intel virtualization technology provides hardware support for processor virtualization, enabling simplifications of virtual machine monitor software. Resulting VMMs can support a wider range of legacy and future operating systems while maintaining high performance.

international symposium on computer architecture | 1996

Memory Bandwidth Limitations of Future Microprocessors

Alain Kagi; James R. Goodman; Doug Burger

This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.

international symposium on computer architecture | 1997

Efficient Synchronization: Let Them Eat QOLB /sup1/

Alain Kagi; Doug Burger; James R. Goodman

Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.

international symposium on microarchitecture | 1997

Limited bandwidth to affect processor design

Doug Burger; James R. Goodman; Alain Kagi

This paper quantifies and compares the performance impacts of memory latencies and finite bandwidth. We show that the implementation of aggressive latency tolerance techniques aggravates stalls due to finite memory bandwidth, which actually become more significant than stalls resulting from uncongested memory latency alone. We expect that memory bandwidth limitations across the processor pins will drive significant architectural change. An execution-driven simulation measures the time that several SPEC95 benchmarks spend stalled for memory latency, limited-memory bandwidth and computing.

high performance computer architecture | 2000

Improving the throughput of synchronization by insertion of delays

Ravi Rajwar; Alain Kagi; James R. Goodman

Efficiency of synchronization mechanisms can limit the parallel performance of many shared-memory applications. In addition, the ever increasing performance gap between processor and interprocessor communication may further compromise the scalability of these primitives. Ideally, synchronization primitives should provide high performance under both high and low contention without requiring substantial programmer effort and software support. QOLR has been shown to offer substantial speedups and to outperform other synchronization primitives consistently, but at the cost of software support and protocol complexity. This paper proposes the use of speculation and delays to implement a purely hardware-based queueing mechanism called Implicit QOLB. Making use of the pervasiveness of the Load-Linked/Store-Conditional primitives, we present a series of hardware mechanisms to optimize performance for sharing patterns exhibited by locks and associated data. The mechanisms do not require any change to existing software or instruction sets. IQOLB sits alongside the cache-coherence protocol and guides the decisions the protocol makes with respect to lock (and associated data) transfers. Preliminary evaluations indicate that IQOLB may perform as well as, if not better than, QOLB without the additional software and protocol complexity.

international conference on supercomputing | 1995

Techniques for reducing overheads of shared-memory multiprocessing

Alain Kagi; Nagi M. Aboulenein; Douglas C. Burger; James R. Goodman

The fine-grain nature of shared-memory multiprocessor communication introduces overheads that can be substantial. Using the Scalable Coherent Interface (SCI) as a base hardware platform and the SPLASH benchmark suite for applications, we analyze three techniques to reduce this overhead: (i) efficient synchronization primitives, and in particular a hardware primitive called QOLB; (ii) weakened memory ordering constraints; and (iii) optimization of the cache-coherence protocol for two nodes sharing data. We perform simulations both for current technology and technology that we anticipate will be available five years hence. We find that QOLB (of which this study performs the first detailed simulations) shows a large and consistent improvement, much larger than that predicted by Mellor-Crummey and Scott [19]. The relaxation of memory ordering constraints also provides a consistent performance improvement. In accordance with prior results, we show that a more aggressive memory model produces more substantial performance improvements. The optimization for twonode sharing shows mixed results, correlating unsurprisingly with the presence of that sharing pattern in an application. Our most important results are (i) that the overheads eliminated with these optimizations are largely orthogonal—the performance gains from supporting multiple optimizations concurrently are for the most part additive—and (ii) that technological improvements increase both these overheads and the success of the optimizations at reducing them.

international conference on supercomputing | 2003

Inferential queueing and speculative push for reducing critical communication latencies

Ravi Rajwar; Alain Kagi; James R. Goodman

Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of Inferentially Queued Locks (IQLs) [31], not just for efficient synchronization but also for reducing communication latencies, and we propose a novel mechanism, Speculative Push (SP), aimed at reducing these communication latencies. With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.

international conference on supercomputing | 2004

Inferential queueing and speculative push

Ravi Rajwar; Alain Kagi; James R. Goodman

Communication latencies within critical sections constitute a major bottleneck in some classes of emerging parallel workloads. In this paper, we argue for the use of two mechanisms to reduce these communication latencies: Inferentially Queued locks (IQLs) and Speculative Push (SP). With IQLs, the processor infers the existence, and limits, of a critical section from the use of synchronization instructions and joins a queue of lock requestors, reducing synchronization delay. The SP mechanism extracts information about program structure by observing IQLs. SP allows the cache controller, responding to a request for a cache line that likely includes a lock variable, to predict the data sets the requestor will modify within the associated critical section. The controller then pushes these lines from its own cache to the target cache, as well as writing them to memory. Overlapping the protected data transfer with that of the lock can substantially reduce the communication latencies within critical sections. By pushing data in exclusive state, the mechanism can collapse a read-modify-write sequences within a critical section into a single local cache access. The write-back to memory allows the receiving cache to ignore the push. Neither mechanism requires any programmer or compiler support nor any instruction set changes. Our experiments demonstrate that IQLs and SP can improve performance of applications employing frequent synchronization.

Archive | 2001