Stephen W. Melvin
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephen W. Melvin.
international symposium on microarchitecture | 1988
Stephen W. Melvin; Michael C. Shebanow; Yale N. Patt
Microarchitectures that implement conventional instruction set architectures are usually limited in that they are only able to execute a small number of microoperations concurrently. This limitation is due in part to the fact that the units of work that the hardware treats as indivisible are small. While this limitation is not important for microarchitectures with a low level of functionality, it can be significant if the goal is to build hardware that can support a large number of microoperations executing concurrently. In this paper we address the tradeoffs associated with the sizes of the various units of work that a processor considers indivisible, or atomic. We argue that by allowing larger units of work to be atomic, restrictions on concurrent operation are reduced and performance is increased. We outline the implementation of a front end for a dynamically scheduled processor with hardware support for large atomic units. We discuss tradeoffs in the design and show that with a modest investment in hardware, the run-time advantages of large atomic units can be realized without the need to alter the instruction set architecture.
international symposium on microarchitecture | 1985
Yale N. Patt; Stephen W. Melvin; Wen-mei W. Hwu; Michael C. Shebanow
HPS is a new model for a high performance microarchitecture which is targeted for implementing very dissimilar ISP architectures. It derives its performance from executing the operations within a restricted window of a program out-of-order, asynchronously, and concurrently whenever possible. Before the model can be reduced to an effective working implementation of a particular target architecture, several issues need to be resolved. This paper discusses these issues, both in general and in the context of architectures with specific characteristics.
International Journal of Parallel Programming | 1995
Stephen W. Melvin; Yale N. Patt
It is now generally recognized that not enough parallelism exists within the small basic blocks of most general purpose programs to satisfy high performance processors. Thus, a wide variety of techniques have been developed to exploit instruction level parallelism across basic block boundaries. In this paper we discuss some previous techniques along with their hardware and software requirements. Then we propose a new paradigm for an instruction set architecture (ISA):block-structuring. This new paradigm is presented, its hardware and software requirements are discussed and the results from a simulation study are presented. We show that a block-structured ISA utilizes both dynamic and compile-time mechanisms for exploiting instruction level parallelism and has significant performance advantages over a conventional ISA.
international symposium on computer architecture | 1991
Stephen W. Melvin; Yale N. Patt
It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this ABSTRACT It has been suggested that non-scientific code has very little parallelism not already exploited by existing vrocesso~s. In this paper we show that &nt& to this notiOK (here is actually a significant amount of unexploited parallelism in typical general purpose code. In order to exploit this parallelism, a combination of hardware and software techniques must be applied. We analyze three techniques: dynamic scheduling, speculative execution and basic block enlargement. We will show that indeed for narrow instruction words little is tobegainedby applying these techniques. However, as the number of simultaneous operations increases, it becomes possible to achieve speedups of three to six on realistic processors.
compilers, architecture, and synthesis for embedded systems | 2002
Stephen W. Melvin; Yale N. Patt
Network processors are being asked to perform increasingly complex operations on packets of information at faster and faster rates. Because processor performance and memory cycle times are not keeping up with this demand, there is a fundamental need for simultaneous processing of multiple packets, and the degree of this parallelism is increasing. Sometimes a dependency exists between two packets currently being operated on, and as the ratio of packet processing time to packet transmission time increases, these dependencies are more likely to impact performance. Thus, the way packet dependencies are handled will become critical. In this paper we show that there is potentially a dramatic difference in performance between optimal and non-optimal solutions. We argue that this is the key challenge that must be addressed in highly parallel network processors. We discuss how work in thread level speculation relates to this problem and describe a practical hardware implementation that requires little or no changes to software and with near optimal performance.
international symposium on microarchitecture | 1986
Yale N. Patt; Stephen W. Melvin; Wen-mei W. Hwu; Michael C. Shebanow; Chien Chen
The VAX architecture is a popular ISP architecture that has been implemented in several different technologies targeted to a wide range of performance specifications. However, it has been argued that the VAX has specific characteristics which preclude a very high performance implementation. We have developed a microarchitecture (HPS) which is specifically intended for implementing very high performance computing engines. Our model of execution is a restriction on fine granularity data flow. In this paper, we concentrate on one particular aspect of an HPS implementation of the VAX architecture: the generation of HPS microinstructions (i.e. data flow nodes) from a VAX instruction stream.
international conference on supercomputing | 1989
Stephen W. Melvin; Yale N. Patt
In this paper we identify three types of atomic units, or indivisible units of work: architectural atomic units (defined by architecture level interrupts and exceptions), compiler atomic units (defined by compiler code generation) and execution atomic units (defined by run-time interruptibility). We discuss trade-offs for these units and show that size has different performance implications depending on the atomic unit. We simulate a number of different implementations of the VAX architecture, focusing on different execution atomic unit sizes. We show that significant performance benefits can be achieved by having large execution atomic units in dynamically scheduled machines.
international symposium on microarchitecture | 1987
James E. Wilson; Stephen W. Melvin; Michael C. Shebanow; Wen-mei W. Hwu; Yale N. Patt
The HPS Microarchitecture has been developed as an execution model for implementing various architectures at very high performance. A considerable amount of effort has gone into the use of HPS as a microarchitecture for the VAX. In this paper, we describe our first full simulation of the microVAX subset, and report the results of varying (i.e. tuning) certain important parameters.
international symposium on microarchitecture | 1986
Jeffrey D. Gee; Stephen W. Melvin; Yale N. Patt
We have implemented a high performance Prolog engine by directly executing in microcode the constructs of Warrens Abstract Machine. The implementation vehicle is the VAX 8600 computer. The VAX 8600 is a general purpose processor containing 8K words of writable control store. In our system, each of the Warren Abstract Machine instructions is implemented as a VAX 8600 machine level instruction. Other Prolog built-ins are either implemented directly in microcode or executed by the general VAX instruction set. Initial results indicate that our system is the fastest implementation of Prolog on a commercially available general purpose processor.
measurement and modeling of computer systems | 1988
Stephen W. Melvin; Yale N. Patt
We have developed a tool based on microcode modifications to a VAX 8600 which allows a wide variety of operating system measurements to be taken with minimal perturbation and without the need to modify any operating system software. A trace of interrupts, exceptions, system calls and context switches is generated as a side-effect to normal execution. In this paper we describe the tool we have developed and present some results we have gathered under both UNIX 4.3 BSD and VAX/VMS V4.5. We compare the process fork behavior of two different command shells under UNIX, look at context switch rates for interactive and batch workloads and generate a histogram for network interrupt service time.