Whay Sing Lee
Sun Microsystems
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Whay Sing Lee.
international symposium on computer architecture | 1998
Stephen W. Keckler; William J. Dally; Daniel Maskit; Nicholas P. Carter; Andrew Chang; Whay Sing Lee
Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on chip cache. With a three-processor implementation of the MAP, fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.
IEEE Computer | 1998
Whay Sing Lee; William J. Dally; Stephen W. Keckler; Nicholas P. Carter; Andrew Chang
With increasing demand for computing power, multiprocessing computers will become more common in the future. In these systems, the growing discrepancy between processor and memory technologies will cause tightly integrated message interfaces to be essential for achieving the necessary efficiency, which is especially important in light of the growing interest in software-distributed, shared memory systems. The authors conduct a performance evaluation of several primitive messaging mechanisms-dispatch mechanisms (how the processor reacts to message arrivals), memory mapped versus register mapped interfaces, and streaming versus buffered interfaces-baselining these results against the MIT M-Machine and its tightly integrated message interfaces. They find that a message can be dispatched up to 18 times faster by reserving a hardware thread context for message reception instead of an interrupt driven interface. They also find that the mapping decision is important, with integrated register mapped interfaces as much as 3.5 times more efficient than conventional systems. To meet the challenges and exploit the opportunities presented by emerging multithreaded processor architectures, low overhead mechanisms for protection against message corruption, interception, and starvation must be integral to the message system design. The authors hope that the simple messaging mechanisms presented can help provide a solution to these challenges.
international conference on computer design | 1998
Andrew Chang; William J. Dally; Stephen W. Keckler; Nicholas P. Carter; Whay Sing Lee
Continuing reductions in on-chip geometries yield increasing numbers of transistors per chip and fundamentally faster devices but also result in effectively slower wires. This combination presents significant challenges for new microprocessor architectures. The disparity in performance between on-chip arithmetic units and memory creates longer effectively latencies. The changing balance between gate delay and wire delay penalizes global interactions. The MIT Multi-ALUP processor (IMRP) architecture incorporates three explicitly parallel mechanisms to address these challenges. Efficient intercluster interactions enable instruction scheduling across clustered arithmetic units. Deferred exceptions based on ERRVALs facilitate aggressive instruction reordering and speculation. Zero-cycle multithreading provides latency tolerance without sacrificing single threaded performance. In this paper; we describe each of these mechanisms and quantify their impact on the area and routing of the cluster pipeline in the 5 Million transistor MAP chip. Zero-cycle multithreading accounts for over 44% of the total cluster area. Support for ERRVALs requires very little area (less than 4%). The intercluster interaction mechanisms require minimal cluster area and less than 5% of the available global routing resources, but enable fully general access across clusters and between all arithmetic units.
ieee international conference on high performance computing data and analytics | 2000
Nicholas P. Carter; William J. Dally; Whay Sing Lee; Stephen W. Keckler; Andrew Chang
The M-Machines combined hardware-software shared-memory system provides significantly lower remote memory latencies than software DSM systems while retaining the flexibility of software DSM. This system is based around four hardware mechanisms for shared memory: status bits on individual memory blocks, hardware translation of memory addresses to home processors, fast detection of remote accesses, and dedicated thread slots for shared-memory handlers. These mechanisms have been implemented on the MAP processor, and allow remote memory references to be completed in as little as 336 cycles at low hardware cost.
Archive | 2001
Nisha D. Talagala; Whay Sing Lee; Chia Y. Wu
Archive | 2000
Whay Sing Lee; Randall D. Rettberg
Archive | 2000
Whay Sing Lee; Nisha D. Talagala
Archive | 2001
Whay Sing Lee; Randall D. Rettberg
Archive | 2001
Whay Sing Lee; Randall D. Rettberg
Archive | 2001
Whay Sing Lee