Raymond R. Hoare | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raymond R. Hoare is active.

Explore More

Publication

Featured researches published by Raymond R. Hoare.

conference on high performance computing (supercomputing) | 2005

On the Feasibility of Optical Circuit Switching for High Performance Computing Systems

Kevin J. Barker; Alan F. Benner; Raymond R. Hoare; Adolfy Hoisie; Darren J. Kerbyson; Dan Li; Rami G. Melhem; Ramakrishnan Rajamony; Eugen Schenfeld; Shuyi Shao; Craig B. Stunkel; Peter A. Walker

The interconnect plays a key role in both the cost and performance of large-scale HPC systems. The cost of future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between electronic switches. We describe a potentially cheaper and more power-efficient approach to building high-performance interconnects. Through empirical analysis of HPC applications, we find that the bulk of inter-processor communication (barring collectives) is bounded in degree and changes very slowly or never. Thus we propose a two-network interconnect: An Optical Circuit Switching (OCS) network handling long-lived bulk data transfers, using optical switches; and a secondary lower-bandwidth Electronic Packet Switching (EPS) network. An OCS could be significantly cheaper, as it uses fewer optical transceivers than an electronic network. Collectives and transient communication packets traverse the electronic network. We present compiler techniques and dynamic run-time policies, using this two-network interconnect. Simulation results show that our approach provides high performance at low cost.

field programmable gate arrays | 2005

An FPGA-based VLIW processor with custom hardware execution

Raymond R. Hoare; Dara Kusic; Joshua Fazekas; John Foster

The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices continues to increase with each new line of devices. Efficiently programming these devices is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally targeted to embedded DSP microprocessors such as signal and image processing applications.This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions. To support this architecture, a compilation and design automation flow are described for programs written in C.Several design tradeoffs for the architecture were examined including number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply accumulate operations.We show that our combined VLIW with hardware functions exhibit as much as 230X speedup and 63X on average for computational kernels for a set of benchmarks. This allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.

EURASIP Journal on Advances in Signal Processing | 2006

Rapid VLIW processor customization for signal processing applications using combinational hardware functions

Raymond R. Hoare; Dara Kusic; Joshua Fazekas; John Foster; Shen Chih Tung; Michael L. McCloud

This paper presents an architecture that combines VLIW (very long instruction word) processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1) a 4-way VLIW processor implemented in an FPGA, (2) large speedups through hardware functions, (3) a hardware/software interface with zero overhead, (4) a design methodology for implementing signal processing applications on this architecture, (5) tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to times that of software with an average times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.

international parallel and distributed processing symposium | 2005

Switch design to enable predictive multiplexed switching in multiprocessor networks

Zhu Ding; Raymond R. Hoare; Dan Li; Shou-Kuo Shao; Shen-Chien Tung; Jiang Zheng; Rami G. Melhem

Predictive multiplexed switching is a new approach for building interconnection switches for high performance parallel systems. This approach advocates sacrificing some link bandwidth in return for more efficient network control and simpler connection management. The main idea is to depart from the traditional packet and wormhole switching in favor of row data communication over established communication pipes (connections). The overhead of this circuit switching approach can be justified when established connections are repeatedly used before they are torn down. For this, we use multiplexing to allow multiple connections to share the same resources (links and switches), thus avoiding tearing down connections prematurely. The connection establishment overhead is further reduced by exploring communication locality and predictability in applications that exhibit these properties. We present the design of an interconnection system which is based on multiplexed switching and which establishes connections either reactively, in response to dynamically generated requests, or proactively, in response to compiler or application directives. A communication prediction component may be supported to reduce the network control overhead in applications that exhibit communication locality and predictability. The design is evaluated using hardware design, synthesis, and cycle-accurate simulation. Comparison with more traditional switching paradigms shows the potential of our predictive multiplexed switching approach.

ACM Transactions in Embedded Computing Systems | 2006

Reducing power while increasing performance with supercisc

Raymond R. Hoare; Dara Kusic; Gayatri Mehta; Joshua Fazekas; John Foster

Multiprocessor Systems on Chips (MPSoCs) have become a popular architectural technique to increase performance. However, MPSoCs may lead to undesirable power consumption characteristics for computing systems that have strict power budgets, such as PDAs, mobile phones, and notebook computers. This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution. SuperCISC is a heterogeneous, multicore processor architecture designed to exceed performance of traditional embedded processors while maintaining a reduced power budget compared to low-power embedded processors. At the heart of the SuperCISC processor is a multicore VLIW (Very Large Instruction Word) containing several homogeneous execution cores/functional units. In addition, complex and heterogeneous combinational hardware function cores are tightly integrated to the core VLIW engine providing an opportunity for improved performance and reduced energy consumption. Our SuperCISC processor core has been synthesized for both a 90-nm Stratix II Field Programmable Gate Aray (FPGA) and a 160-nm standard cell Application-Specific Integrated Circuit (ASIC) fabrication process from OKI, each operating at approximately 167 MHz for the VLIW core. We examine several reasons for speedup and power improvement through the SuperCISC architecture, including predicated control flow, cycle compression, and a reduction in arithmetic power consumption, which we call power compression. Finally, testing our SuperCISC processor with multimedia and signal-processing benchmarks, we show how the SuperCISC processor can provide performance improvements ranging from 7X to 160X with an average of 60X, while also providing orders of magnitude of power improvements for the computational kernels. The power improvements for our benchmark kernels range from just over 40X to over 400X, with an average savings exceeding 130X. By combining these power and performance improvements, our total energy improvements all exceed 1000X. As these savings are limited to the computational kernels of the applications, which often consume approximately 90% of the execution time, we expect our savings to approach the ideal application improvement of 10X.

conference on high performance computing (supercomputing) | 2006

Level-wise scheduling algorithm for fat tree interconnection networks

Zhu Ding; Raymond R. Hoare; Rami G. Melhem

This paper presents an efficient hardware architecture for scheduling connections on a fat-tree interconnection network for parallel computing systems. Our technique utilizes global routing information to select upward routing paths so that most conflicts can be resolved. Thus, more connections can be successfully scheduled compared with a local scheduler. As a result of applying our technique to two-level, three-level and four-level fat-tree interconnection networks of various sizes in the range of 64 to 4096 nodes, we observe that the improvement of schedulability ratio averages 30% compared with greedy or random local scheduling. Our technique is also scalable and shows increased benefits for large system sizes

design automation conference | 2006

An automated, reconfigurable, low-power RFID tag

Raymond R. Hoare; Swapna Dontharaju; Shen Chih Tung; Ralph Sprang; Joshua Fazekas; James T. Cain; Marlin H. Mickle

This paper describes an ultra low power active RFID tag and its automated design flow. RFID primitives to be supported by the tag are enumerated with RFID macros and the behavior of each primitive is specified using ANSI-C within the template to automatically generate the tag controller. Two power saving components, a passive transceiver/burst switch and a smart buffer, are presented to save power and increase tag lifetime. Based on a test program, the processors required 183, 43, and 19 muJ per transaction for StrongARM, XScale, and EISC processors, respectively. Three hardware controllers using a Fusion FPGA, Coolrunner II CPLD, and ASIC required 13 nJ, 1.3 nJ, and 0.07 nJ per transaction

Eurasip Journal on Embedded Systems | 2006

Speech silicon: an FPGA architecture for real-time hidden Markov-model-based speech recognition

Jeffrey William Schuster; Kshitij Gupta; Raymond R. Hoare

This paper examines the design of an FPGA-based system-on-a-chip capable of performing continuous speech recognition on medium sized vocabularies in real time. Through the creation of three dedicated pipelines, one for each of the major operations in the system, we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls in the system. Further, by implementing a token-passing scheme between the later stages of the system, the complexity of the control was greatly reduced and the amount of active data present in the system at any time was minimized. Additionally, through in-depth analysis of the SPHINX 3 large vocabulary continuous speech recognition engine, we were able to design models that could be efficiently benchmarked against a known software platform. These results, combined with the ability to reprogram the system for different recognition tasks, serve to create a system capable of performing real-time speech recognition in a vast array of environments.

field-programmable custom computing machines | 2006

A Field Programmable RFID Tag and Associated Design Flow

Raymond R. Hoare; Swapna Dontharaju; Shen Chih Tung; Ralph Sprang; Joshua Fazekas; James T. Cain; Marlin H. Mickle

Current radio frequency identification (RFID) systems generally have long design times and low tolerance to changes in specification. This paper describes a field programmable, low-power active RFID tag, and its associated specification and automated design flow. RFID primitives to be supported by the tag are enumerated with RFID macros, or assembly-like descriptions of the tag operations. From these, the RFID preprocessor generates templates automatically. The behavior of each RFID primitive is specified using ANSI C in the template. The resulting file is compiled by the RFID compiler. A smart buffer sits between the transceiver and the tag controller, to detect whether incoming packets are intended for the tag. By doing so, the main controller may remain powered down to reduce power consumption. Two system-on-a-chip implementation strategies are presented. First, a microprocessor based system for which a C program is automatically generated. The second includes a block of low-power FPGA logic. The user supplied RFID logic in ANSI-C is automatically converted into combinational VHDL by the RFID compiler. Based on a test program, the processors required 183, 43, and 19 muJ per transaction for StrongARM, XScale, and EISC processors, respectively. By replacing the processor with a Coolrunner II, the controller can be reduced to 1.11 nJ per transaction

international parallel and distributed processing symposium | 2006

Design space exploration for low-power reconfigurable fabrics

Gayatri Mehta; Raymond R. Hoare; Justin Stander

This paper presents a parameterizable, coarse grained, reconfigurable fabric model that attempts to maintain field programmable gate array (FPGA)-like programmability and computer aided design (CAD), with application specific integrated circuit (ASIC)-like power characteristics for digital signal processing (DSP) style applications. Using this model, architectural design space decisions are explored in order to define an energy-efficient fabric. The impact on energy and performance due to the variation of different parameters such as data width and interconnection flexibility has been studied. The multiplexer cardinality usage has also been studied by mapping some of the signal processing applications onto the fabric. The results point to the use of power optimized 32-bit width computational elements interconnected by low cardinality multiplexers like 4:1 multiplexers

Explore More