Is this you? Create Your Porfile

Beng-Hong Lim

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Beng-Hong Lim is active.

Explore More

Publication

Featured researches published by Beng-Hong Lim.

international symposium on computer architecture | 1990

APRIL: a processor architecture for multiprocessing

Anant Agarwal; Beng-Hong Lim; David A. Kranz; John Kubiatowicz

Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of two over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.

international symposium on microarchitecture | 1993

Sparcle: an evolutionary processor design for large-scale multiprocessors

Anant Agarwal; John Kubiatowicz; David A. Kranz; Beng-Hong Lim; Donald Yeung; Godfrey D'Souza; Mike Parkin

The design of the Sparcle chip, which incorporates mechanisms required for massively parallel systems in a Sparc RISC core, is described. Coupled with a communications and memory management chip (CMMU) Sparcle allows a fast, 14-cycle context switch, an 8-cycle user-level message send, and fine-grain full/empty-bit synchronization. Sparcles fine-grain computation, memory latency tolerance, and efficient message interface are discussed. The implementation of Sparcle as a CPU for the Alewife machine is described.<<ETX>>

acm sigplan symposium on principles and practice of parallel programming | 1993

Integrating message-passing and shared-memory: early experience

David A. Kranz; Beng-Hong Lim; Kirk L. Johnson; John Kubiatowicz; Anant Agarwal

This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and message-passing architectures, we argue that the transparent, coherent caching of global data provided by many shared-memory architectures is of crucial importance. Because message-passing mechanisms ar much more efficient than shared-memory loads and stores for certain types of interprocessor communication and synchronization operations, hwoever, we argue for building multiprocessors that efficiently support both shared-memory and message-passing mechnisms. We describe an architecture, Alewife, that integrates support for shared-memory and message-passing through a simple interface; we expect the compiler and runtime system to cooperate in using appropriate hardware mechanisms that are most efficient for specific operations. We report on both integrated and exclusively shared-memory implementations of our runtime system and two applications. The integrated runtime system drastically cuts down the cost of communication incurred by the scheduling, load balancing, and certain synchronization operations. We also present preliminary performance results comparing the two systems.

architectural support for programming languages and operating systems | 1994

Reactive synchronization algorithms for multiprocessors

Beng-Hong Lim; Anant Agarwal

Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice of protocols to use for implementing the synchronization operation. For example, candidate protocols for locks include test-and-set protocols and queueing protocols. Frequently, the best choice of protocols depends on the level of contention: previous research has shown that test-and-set protocols for locks outperform queueing protocols at low contention, while the opposite is true at high contention. This paper investigates reactive synchronization algorithms that dynamically choose protocols in response to the level of contention. We describe reactive algorithms for spin locks and fetch-and-op that choose among several shared-memory and message-passing protocols. Dynamically choosing protocols presents a challenge: a reactive algorithm needs to select and change protocols efficiently, and has to allow for the possibility that multiple processes may be executing different protocols at the same time. We describe the notion of consensus objects that the reactive algorithms use to preserve correctness in the face of dynamic protocol changes. Experimental measurements demonstrate that reactive algorithms perform close to the best static choice of protocols at all levels of contention. Furthermore, with mixed levels of contention, reactive algorithms outperform passive algorithms with fixed protocols, provided that contention levels do not change too frequently. Measurements of several parallel applications show that reactive algorithms result in modest performance gains for spin locks and significant gains for fetch-and-op.

ACM Transactions on Computer Systems | 1993

Waiting algorithms for synchronization in large-scale multiprocessors

Beng-Hong Lim; Anant Agarwal

Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread first waits by polling a synchronization variable. If the cost of polling reaches a limit <italic>L<subscrpt>poll</subscrpt></italic> and further waiting is necessary, the thread is blocked, incurring an additional fixed cost, <italic>B</italic>. The choice of <italic>L<subscrpt>poll</subscrpt></italic> is a critical determinant of the performance of two-phase algorithms. We focus on methods for statically determining <italic>L<subscrpt>poll</subscrpt></italic> because the run-time overhead of dynamically determining <italic>L<subscrpt>poll</subscrpt></italic> can be comparable to the cost of blocking in large-scale multiprocessor systems with lightweight threads. Our experiments show that <italic>always-block</italic> (<italic>L<subscrpt>poll</subscrpt></italic> = 0) is a good waiting algorithm with performance that is usually close to the best of the algorithms compared. We show that even better performance can be achieved with a static choice of <italic>L<subscrpt>poll</subscrpt></italic> based on knowledge of likely wait-time distributions. Motivated by the observation that different synchronization types exhibit different wait-time distributions, we prove that a static choice of <italic>L<subscrpt>poll</subscrpt></italic> can yield close to optimal on-line performance against an adversary that is restricted to choosing wait times from a fixed family of probability distributions. This result allows us to make an optimal static choice of <italic>L<subscrpt>poll</subscrpt></italic> based on synchronization type. For exponentially distributed wait times, we prove that setting <italic>L<subscrpt>poll</subscrpt></italic> = 1n(e-1)<italic>B</italic> results in a waiting cost that is no more than <italic>e/(e-1)</italic> times the cost of an optimal off-line algorithm. For uniformly distributed wait times, we prove that setting <italic>L</italic><subscrpt>poll</subscrpt>=1/2(square root of 5 -1)<italic>B</italic> results in a waiting cost that is no more than (square root of 5 + 1)/2 (the golden ratio) times the cost of an optimal off-line algorithm. Experimental measurements of several parallel applications on the Alewife multiprocessor simulator corroborate our theoretical findings.

Proceedings of the IEEE | 1999

The MIT Alewife Machine

Anant Agarwal; Ricardo Bianchini; David Chaiken; Frederic T. Chong; Kirk L. Johnson; David M. Kranz; John Kubiatowicz; Beng-Hong Lim; Kenneth Mackenzie; Donald Yeung

A variety of models for parallel architectures, such as shared memory, message passing, and data flow, have converged in the recent past to a hybrid architecture form called distributed shared memory (DSM). Alewife, an early prototype of such DSM architectures, uses hybrid software and hardware mechanisms to support coherent shared memory, efficient user level messaging, fine grain synchronization, and latency tolerance. Alewife supports up to 512 processing nodes connected over a scalable and cost effective mesh network at a constant cost per node. Four mechanisms combine to achieve Alewifes goals of scalability and programmability: software extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication. Extensive results from microbenchmarks, together with over a dozen complete applications running on a 32-node prototype, demonstrate that integrating message passing with shared memory enables a cost efficient solution to the cache coherence problem and provides a rich set of programming primitives. Our results further show that messaging and shared memory operations are both important because each helps the programmer to achieve the best performance for various machine configurations.

measurement and modeling of computer systems | 1996

Limits on the performance benefits of multithreading and prefetching

Beng-Hong Lim; Ricardo Bianchini

This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental measurements of parallel applications on the MIT Alewife multiprocessor. For the first time, both techniques are evaluated on a real machine as opposed to simulations. The models determine the region in the parameter space where the techniques are most effective, while the measurements determine the region where the applications lie. We find that these regions do not always overlap significantly.The multithreading model shows that only 2-4 contexts are necessary to maximize this techniques potential benefit in current multiprocessors. Multithreading improves execution time by less than 10% for most of the applications that we examined. The model also shows that multithreading can significantly improve the performance of the same applications in multiprocessors with longer latencies. Reducing context-switch overhead is not crucial.The software prefetching model shows that allowing 4 outstanding prefetches is sufficient to achieve most of this techniques potential benefit on current multiprocessors. Prefetching improves performance over a wide range of parameters, and improves execution time by as much as 20-50% even on current multiprocessors. The two models show that prefetching has a significant advantage over multithreading for machines with low memory latencies and/or applications with high cache miss rates because a prefetch instruction consumes less time than a context-switch.

Journal of Parallel and Distributed Computing | 1996

Evaluating the Performance of Multithreading and Prefetching in Multiprocessors

Ricardo Bianchini; Beng-Hong Lim

This paper presents new analytical models of the performance benefits of multithreading and prefetching, and experimental measurements of parallel applications on the MIT Alewife multiprocessor. For the first time, both techniques are evaluated on a real machine as opposed to simulations. The models determine the region in the parameter space where the techniques are most effective, while the measurements determine the region where the applications lie. We find that these regions do not always overlap significantly. The multithreading model shows that only 2?4 contexts are necessary to maximize this techniques potential benefit in current multiprocessors. For these multiprocessors, multithreading improves execution time by less than 10% for most of the applications that we examined. The model also shows that multithreading can significantly improve the performance of the same applications in multiprocessors with longer latencies. Reducing context-switch overhead is not crucial. The software prefetching model shows that allowing 4 outstanding prefetches is sufficient to achieve most of this techniques potential benefit on current multiprocessors. Prefetching improves performance over a wide range of parameters, and improves execution time by as much as 20?50% even on current multiprocessors. A comparison between the two models shows that prefetching has a significant advantage over multithreading for machines with low memory latencies and/or applications with high cache miss rates, because a prefetch instruction consumes less time than a context-switch.

IEEE Computer | 1996

Application performance on the MIT Alewife machine

Frederic T. Chong; Beng-Hong Lim; Ricardo Bianchini; John Kubiatowicz; Anant Agarwal

The architecture of parallel machines influences the structure of parallel programs, and vice versa. An important result of research on shared memory applications is two sets of benchmarks, Splash and NAS. We explore the performance of 14 applications on the Alewife machine, introducing two new performance metrics: weighted cache-hit ratio and weighted computation granularity.

Proceedings of the US/Japan Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications | 1992

Sparcle: A Multithreaded VLSI Processor for Parallel Computing

Anant Agarwal; Jonathan Babb; David Chaiken; Godfrey D'Souza; Kirk L. Johnson; David A. Kranz; John Kubiatowicz; Beng-Hong Lim; Gino K. Maa; Kenneth Mackenzie; Daniel Nussbaum; Mike Parkin; Donald Yeung

The Sparcle chip will clock at no more than 50 MHz. It has no more than 200K transistors. It does not use the latest technologies and dissipates a paltry 2 watts. It has no on-chip cache, no fancy pads, and only 207 pins. It does not even support multiple-instructions issue. Then, why do we think this chip is interesting? Sparcle is a processor chip designed for large-scale multiprocessing. Processors suitable for multiprocessing environments must meet several requirements:

Explore More