Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Paul Chow is active.

Publication


Featured researches published by Paul Chow.


international symposium on microarchitecture | 1997

The multicluster architecture: reducing cycle time through partitioning

Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic

The multicluster architecture that we introduce offers a decentralized, dynamically scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using trace-driven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered the multicluster architecture may have significant performance advantages at feature sizes below 0.35 /spl mu/m, and warrants further investigation.


IEEE Transactions on Very Large Scale Integration Systems | 1999

The design of a SRAM-based field-programmable gate array-Part II: Circuit design and layout

Paul Chow; Soon Ong Seo; Jonathan Rose; Kevin Chung; Gerard Páez-Monzón; Immanuel Rahardja

For Pt.I see ibid., vol.7, pp.191-7 (1999). Field-programmable gate arrays (FPGAs) are now widely used for the implementation of digital systems, and many commercial architectures are available. Although the literature and data books contain detailed descriptions of these architectures, there is very little information on how the high-level architecture was chosen and no information on the circuit-level or physical design of the devices. In Part I of this paper, we described the high-level architectural design of a static random-access memory programmable FPGA. This paper will address the circuit-design issues through to the physical layout. We address area-speed tradeoffs in the design of the logic block circuits and in the connections between the logic and the routing structure. All commercial FPGA designs are done using full-custom hand layout to obtain absolute minimum die sizes. This is both labor and time intensive. We propose a design style with a minitile that contains a portion of all the components in the logic tile, resulting in less full-custom effort. The minitile is replicated in a 4/spl times/4 array to create a macro tile. The minitile is optimized for layout density and speed, and is customized in the array by adding appropriate vias. This technique also permits easy changing of the hard-wired connections in the logic block architecture and the segmentation length distribution in the routing architecture.


IEEE Journal of Solid-state Circuits | 1990

Architecture of field-programmable gate arrays: the effect of logic block functionality on area efficiency

Jonathan Rose; Robert J. Francis; David Lewis; Paul Chow

The relationship between the functionality of a field-programmable gate array (FPGA) logic block and the area required to implement digital circuits using that logic block is examined. The investigation is done experimentally by implementing a set of industrial circuits as FPGAs using CAD (computer-aided design) tools for technology mapping, placement, and routing. A range of programming technologies (the method of FPGA customization) is explored using a simple model of the interconnection and logic block area. The experiments are based on logic blocks that use lookup tables for implementing combinational logic. Results indicate that the best number of inputs to use (a measure of the blocks functionality) is between three and four, and that a D flip-flop should be included in the logic block. The results are largely independent of the programming technology. More generally, it was observed that the area efficiency of a logic block depends not only on its functionality but also on the average number of pins connected per logic block. >


field programmable gate arrays | 1999

Memory interfacing and instruction specification for reconfigurable processors

Jeffrey A. Jacob; Paul Chow

As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that overcomes the interconnection bottleneck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98. The OneChip-98 system has a seamless programming model that enables the programmer to easily specify instructions without additional complex instruction decoding hardware. As well, there is a simple scheme for mapping instructions to the corresponding programming bits. To allow the processor and the reconfigurable array to execute concurrently, the programming model utilizes a novel memory-consistency scheme implemented in the hardware. To evaluate the feasibility of the OneChip-98 architecture, a 32-bit MIPS-like processor and several performance enhancement applications were mapped to the Transmogrifier-2 field programmable system. For two typical applications, the 2-dimensional discrete cosine transform and the 64-tap FIR filter, we were capable of achieving a performance speedup of over 30 times that of a stand-alone state-of-the-art processor.


international symposium on computer architecture | 1997

Memory-system design considerations for dynamically-scheduled processors

Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic

In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.


high-performance computer architecture | 1996

Register file design considerations in dynamically scheduled processors

Keith I. Farkas; Norman P. Jouppi; Paul Chow

We have investigated the register file requirements of dynamically scheduled processors using register renaming and dispatch queues running the SPEC92 benchmarks. We looked at processors capable of issuing either four or eight instructions per cycle and found that in most cases implementing precise exceptions requires a relatively small number of additional registers compared to imprecise exceptions. Systems with aggressive non-blacking load support were able to achieve performance similar to processors with perfect memory systems at the cost of some additional registers. Given our machine assumptions, we found that the performance of a four-issue machine with a 32-entry dispatch queue tends to saturate around 80 registers. For an eight-issue machine with a 64-entry dispatch queue performance does not saturate until about 128 registers. Assuming the machine cycle time is proportional to the register file cycle time, the 8-issue machine yields only 20% higher performance than the 4-issue machine due in part to the cycle time impact of additional hardware.


architectural support for programming languages and operating systems | 1996

Exploiting dual data-memory banks in digital signal processors

Mazen A. R. Saghir; Paul Chow; Corinna G. Lee

Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for implementing embedded applications in high-volume consumer products. Through their use of specialized hardware features and small chip areas, DSPs provide the high performance necessary for embedded applications at the low costs demanded by the high-volume consumer market. One feature commonly found in DSPs is the use of dual data-memory banks to double the memory systems bandwidth. When coupled with high-order data interleaving, dual memory banks provide the same bandwidth as more costly memory organizations such as a dual-ported memory. However, making effective use of dual memory banks remains difficult, especially for high-level language (HLL) DSP compilers.In this paper, we describe two algorithms --- compaction-based (CB) data partitioning and partial data duplication --- that we developed as part of our research into the effective exploitation of dual data-memory banks in HLL DSP compilers. We show that CB partitioning is an effective technique for exploiting dual data-memory banks, and that partial data duplication can augment CB partitioning in improving execution performance. Our results show that CB partitioning improves the performance of our kernel benchmarks by 13%-40% and the performance of our application benchmarks by 3%-15%. For one of the application benchmarks, partial data duplication boosts performance from 3% to 34%.


field programmable gate arrays | 2001

The effect of reconfigurable units in superscalar processors

E E Jorge Carrillo; Paul Chow

This paper describes OneChip, a third generation reconfigurable processor architecture that integrates a Reconfigurable Functional Unit (RFU) into a superscalar Reduced Instruction Set Computer (RISC) processors pipeline. The architecture allows dynamic scheduling and dynamic reconfiguration. It also provides support for pre-loading configurations and for Least Recently Used (LRU) configuration management. To evaluate the performance of the OneChip architecture, several off-the-shelf software applications were compiled and executed on Sim-OneChip, an architecture simulator for OneChip that includes a software environment for programming the system. The architecture is compared to a similar one but without dynamic scheduling and without an RFU. OneChip achieves a performance improvement and shows a speedup range from 2.16 up to 32 for the different applications and data sizes used. The results show that dynamic scheduling helps performance the most on average, and that the RFU will always improve performance the best when most of the execution is in the RFU.This paper describes OneChip, a third generation reconfigurable processor architecture that integrates a Reconfigurable Functional Unit (RFU) into a superscalar Reduced Instruction Set Computer (RISC) processors pipeline. The architecture allows dynamic scheduling and dynamic reconfiguration. It also provides support for pre-loading configurations and for Least Recently Used (LRU) configuration management. To evaluate the performance of the OneChip architecture, several off-the-shelf software applications were compiled and executed on Sim-OneChip, an architecture simulator for OneChip that includes a software environment for programming the system. The architecture is compared to a similar one but without dynamic scheduling and without an RFU. OneChip achieves a performance improvement and shows a speedup range from 2.16 up to 32 for the different applications and data sizes used. The results show that dynamic scheduling helps performance the most on average, and that the RFU will always improve performance the best when most of the execution is in the RFU.


IEEE Journal of Solid-state Circuits | 1987

MIPS-X: a 20-MIPS peak, 32-bit microprocessor with on-chip cache

Mark Horowitz; Paul Chow; Don Stark; Richard T. Simoni; Arturo Salz; Steven A. Przybylski; John L. Hennessy; Glenn Gulak; Anant Agarwal; John M. Acken

MIPS-X is a 32-b RISC microprocessor implemented in a conservative 2-/spl mu/m, two-level-metal, n-well CMOS technology. High performance is achieved by using a nonoverlapping two-phase 20-MHz clock and executing one instruction every cycle. To reduce its memory bandwidth requirements, MIPS-X includes a 2-kbyte on-chip instruction cache. The authors provide an overview of MIPS-X, focusing on the techniques used to reduce the complexity of the processor and implement the on-chip instruction cache.


IEEE Transactions on Very Large Scale Integration Systems | 1999

The design of an SRAM-based field-programmable gate array. I. Architecture

Paul Chow; Soon Ong Seo; Jonathan Rose; Kevin Chung; Gerard Páez-Monzón; Immanuel Rahardja

Field-programmable gate arrays (FPGAs) are now widely used for the implementation of digital systems, and many commercial architectures are available. Although the literature and data books contain detailed descriptions of these architectures, there is very little information on how the high-level architecture was chosen, and no information on the circuit-level or physical design of the devices. This paper describes the high-level architectural design of a static-random-access memory programmable FPGA. A forthcoming Part II will address the circuit design issues through to the physical layout. The logic block and routing architecture of the FPGA was determined through experimentation with benchmark circuits and custom-built computer-aided design tools. The resulting logic block is an asymmetric tree of four-input lookup tables that are hard-wired together and a segmented routing architecture with a carefully chosen segment length distribution.

Collaboration


Dive into the Paul Chow's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge