Is this you? Create Your Porfile

Randy Huang

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Randy Huang is active.

Explore More

Publication

Featured researches published by Randy Huang.

field programmable gate arrays | 1999

HSRA: high-speed, hierarchical synchronous reconfigurable array

William Tsu; Kip Macy; Atul Joshi; Randy Huang; Norman Walker; Tony Tung; Omid Rowhani; Varghese George; John Wawrzynek; André DeHon

There is no inherent characteristic forcing Field Programmable Gate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Modern FPGAs seldom achieve application clock rates close to their processor cousins because (1) resources in the FPGAs are not balanced appropriately for high-speed operation, (2) FPGA CAD does not automatically provide the requisite transforms to support this operation, and (3) interconnect delays can be large and vary almost continuously, complicating high frequency mapping. We introduce a novel reconfigurable computing array, the High-Speed, Hierarchical Synchronous Reconfigurable Array (HSRA), and its supporting tools. This packagedemonstrates that computing arrays can achieve efficient, high-speedoperation. We have designedand implemented a prototype component in a 0.4 m logic design on a DRAM process which will support 250MHz operation for CAD mapped designs.

field programmable gate arrays | 2002

Analysis of quasi-static scheduling techniques in a virtualized reconfigurable machine

Yury Markovskiy; Eylon Caspi; Randy Huang; Joseph Yeh; Michael Chu; John Wawrzynek; André DeHon

The SCORE compute model uses fixed-size, virtual compute and memory pages connected by stream links to capture the definition of a computation abstracted from the detailed size of the physical hardware. When the number of physical compute pages is smaller than the number of virtual compute pages in the abstract computation graph, the design is time-multiplexed onto the available physical hardware. A key component of this strategy is an automatic scheduler that selects the temporal sequencing of virtual resources onto the physical device. We describe a quasi-static scheduling strategy that retains the full semantic power of the dynamic SCORE flow graph while taking advantage of static scheduling techniques at program load time to hoist most of the computational work out of the inner scheduling loops. This strategy reduces online scheduling work per reconfiguration epoch by an order of magnitude. In addition, a more global perspective available from offline-scheduling improves schedule quality, resulting in a net reduction of total execution time by 46--81%.

Microprocessors and Microsystems | 2006

Stream Computations Organized for Reconfigurable Execution

André DeHon; Yury Markovsky; Eylon Caspi; Michael Chu; Randy Huang; Stylianos Perissakis; Laura Pozzi; Joseph Yeh; John Wawrzynek

Abstract Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g., Verilog, VHDL) and software systems (e.g., C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, and microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larger, and faster hardware platforms. Further, we describe hardware and software techniques that make this late-bound platform mapping viable and efficient.

field-programmable custom computing machines | 2002

Hardware-assisted fast routing

André DeHon; Randy Huang; John Wawrzynek

To fully realize the benefits of partial and rapid reconfiguration of field-programmable devices, we often need to dynamically schedule computing tasks and generate instance-specific configurations-new graphs which must be routed during program execution. Consequently, route time can be a significant overhead cost reducing the achievable net benefits of dynamic configuration generation. BY adding hardware to accelerate routing, we show that it is possible to compute routes in one thousandth the time of a traditional, software router and achieve routes that are within 5% of the state-of-the-art offline routing algorithms for a sample set of application netlists and within 25% for a set of difficult synthetic benchmarks. We further outline how strategic use of parallelism can allow the total route time to scale substantially less than linearly in graph size. We detail the source of the benefits in our approach and survey a range of options for hardware assistance that van, from a speedup of over 10/spl times/ with modest hardware overhead to speedups in excess of 1000/spl times/.

field programmable gate arrays | 2003

Stochastic, spatial routing for hypergraphs, trees, and meshes

Randy Huang; John Wawrzynek; André DeHon

FPGA place and route is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapping for reconfigurable systems. Previous work showed that hardware-assisted routing can accelerate fanout-free routing on Fat-Trees by three orders of magnitude with modest modifications to the network itself. In this paper, we show how these techniques can be applied to any FPGA and how they can be implemented on top of LUT networks in cases where modification of the FPGA itself is not justified. We further show how to accommodate fanout and how to achieve comparable route quality to software-based methods. For a tree network, we estimate an FPGA implementation of our routing logic could route the Toronto Place and Route Benchmarks at least two orders of magnitude faster than a software Pathfinder while achieving within 3% of the aggregate quality. Preliminary results on small mesh benchmarks achieve within one track of vpr-fast.

Microprocessors and Microsystems | 2006

Stochastic spatial routing for reconfigurable networks

André DeHon; Randy Huang; John Wawrzynek

FPGA placement and routing is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapping for reconfigurable systems. We introduce a stochastic search scheme which can achieve comparable route quality to traditional, software-based routers while being amenable to parallel, spatial implementation. We quantify the quality and performance of this route scheme using the Toronto Place-andRoute Challenge benchmarks. We sketch hardware implementations ranging from a minimal hardware-search assistance scheme which provides two orders of magnitude speedup, to FPGA-based schemes which provide greater speedup, to full hardware schemes which provide over three orders of magnitude routing acceleration. For coarse-grained devices with wide-word datapaths, the area overhead for integrating this hardware support into the network can be below 30%; for conventional FPGAs, a collection of hundreds of FPGAs can be configured to route one FPGA rapidly. With parallel path searches, the time required for the spatial solution scales sublinearly in network size for the typical, limited-bisection networks used for practical reconfigurable systems. � 2006 Elsevier B.V. All rights reserved.

field programmable custom computing machines | 2017

Fine-Grained Acceleration of Binary Neural Networks Using Intel® Xeon® Processor with Integrated FPGA

Philip Colangelo; Randy Huang; Enno Luebbers; Martin Margala; Kevin Nealis

Binary weighted networks (BWN) for imageclassification reduce computation for convolutional neuralnetworks (CNN) from multiply-adds to accumulates with littleto no accuracy loss. Hardware architectures such as FPGA cantake full advantage of BWN computations because of theirability to express weights represented as 0 and 1 efficientlythrough customizable logic. In this paper, we present animplementation on Intel®s Xeon® processor with integratedFPGA to accelerate binary weighted networks. We interfaceIntels Accelerator Abstraction Layer (AAL) with Caffe toprovide a robust framework used for accelerating CNN. Utilizing the low latency Quick Path Interconnect (QPI) between the Broadwell Xeon® processor and Arria10 FPGA, we can perform fine-grained offloads for specific portions ofthe network. Due to convolution layers making up most of thecomputation in our experiments, we offload the feature andweight data to customized binary hardware in the FPGA forfaster execution. An initial proof of concept design shows thatby using both the Xeon processor and FPGA together we canimprove the throughput by 2x on some layers and by 1.3xoverall while utilizing only a small percentage of FPGA core logic.

Archive | 2000