Gerald R. Morris
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gerald R. Morris.
IEEE Transactions on Parallel and Distributed Systems | 2007
Ling Zhuo; Gerald R. Morris; Viktor K. Prasanna
Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.
international parallel and distributed processing symposium | 2005
Ling Zhuo; Gerald R. Morris; Viktor K. Prasanna
The use of pipelined floating-point arithmetic cores to create high-performance FPGA-based computational kernels has introduced a new class of problems that do not exist when using single-cycle arithmetic cores. In particular, the data hazards associated with pipelined floating-point reduction circuits can limit the scalability or severely reduce the performance of an otherwise high-performance computational kernel. The inability to efficiently execute the reduction in hardware coupled with memory bandwidth issues may even negate the performance gains derived from hardware acceleration of the kernel. In this paper we introduce a method for developing scalable floating-point reduction circuits that run in optimal time while requiring only /spl Theta/ (lg (n)) space and a single pipelined floating-point unit. Using a Xilinx Virtex-II Pro as the target device, we implement reference instances of our reduction method and present the FPGA design statistics supporting our scalability claims.
field-programmable custom computing machines | 2006
Gerald R. Morris; Viktor K. Prasanna; Richard D. Anderson
Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine general-purpose processors (GPPs) with field-programmable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, application-specific processors. These exciting supercomputers allow end-users to create custom computing architectures aimed at the computationally intensive parts of each problem. This report describes a parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point design for accelerating the conjugate gradient (CG) iterative method on an FPGA-augmented RC. The FPGA-based elements are developed via a hybrid approach that uses a high-level language (HLL)-to-hardware description language (HDL) compiler in conjunction with custom-built, VHDL-based, floating-point components. A reference version of the design is implemented on a contemporary RC. Actual run time performance data compare the FPGA-augmented CG to the software-only version and show that the FPGA-based version runs 1.3 times faster than the software version. Estimates show that the design can achieve a 4 fold speedup on a next-generation RC
IEEE Computer | 2007
Viktor K. Prasanna; Gerald R. Morris
Using a high-level-language to hardware-description-language compiler and some novel architectures and algorithms to map two well-known double-precision floating-point sparse matrix iterative-linear-equation solvers - the Jacobi and conjugate gradient methods - onto a reconfigurable computer achieves more than a twofold speedup over software
field-programmable custom computing machines | 2005
Gerald R. Morris; Ling Zhuo; Viktor K. Prasanna
FPGA-based floating-point kernels must exploit algorithmic parallelism and use deeply pipelined cores to gain a performance advantage over general-purpose processors. Inability to hide the latency of lengthy pipelines can significantly reduce the performance or impose unrealistic buffer requirements. Designs requiring reduction operations such as accumulation are particularly susceptible. In this paper we introduce two high-performance FPGA-based methods for reducing multiple sets of sequentially delivered floating-point values in optimal time without stalling the pipeline.
application-specific systems, architectures, and processors | 2006
Gerald R. Morris; Viktor K. Prasanna; Richard D. Anderson
Reconfigurable computers (RCs) that combine generalpurpose processors with field-programmable gate arrays (FPGAs) are now available. In these exciting systems, the FPGAs become reconfigurable application-specific processors (ASPs). Specialized high-level language (HLL) to hardware description language (HDL) compilers allow these ASPs to be reconfigured using HLLs. In our research we describe a novel toroidal data structure and scheduling algorithm that allows us to use an HLL-to-HDL environment to implement a high-performance ASP that reduces multiple, variable-length sets of 64-bit floating-point data. We demonstrate the effectiveness of our ASP by using it to accelerate a sparse matrix iterative solver. We compare actual wall clock run times of a production-quality software iterative solver with an ASP-augmented version of the same solver on a current generation RC. Our ASP-augmented solver runs up to 2.4 times faster than software. Estimates show that this same design can run over 6.4 times faster on a next-generation RC.
IEEE Transactions on Parallel and Distributed Systems | 2013
Gerald R. Morris; Khalid H. Abed
High-performance heterogeneous computers that employ field programmable gate arrays (FPGAs) as computational elements are known as high-performance reconfigurable computers (HPRCs). For floating-point applications, these FPGA-based processors must satisfy a variety of heuristics and rules of thumb to achieve a speedup compared with their software counterparts. By way of a simple sparse matrix Jacobi iterative solver, this paper illustrates some of the issues associated with mapping floating-point kernels onto HPRCs. The Jacobi method was chosen based on heuristics developed from earlier research. Furthermore, Jacobi is relatively easy to understand, yet is complex enough to illustrate the mapping issues. This paper is not trying to demonstrate the speedup of a particular application nor is it suggesting that Jacobi is the best way to solve equations. The results demonstrate a nearly threefold wall clock runtime speedup when compared with a software implementation. A formal analysis shows that these results are reasonable. The purpose of this paper is to illuminate the challenging floating-point mapping process while simultaneously showing that such mappings can result in significant speedups. The ideas revealed by research such as this have already been and should continue to be used to facilitate a more automated mapping process.
Journal of Parallel and Distributed Computing | 2008
Gerald R. Morris; Viktor K. Prasanna
Reconfigurable computers (RCs) combine general-purpose processors (GPPs) with field programmable gate arrays (FPGAs). The FPGAs are reconfigured at run time to become application-specific processors that collaborate with the GPPs to execute the application. High-level language (HLL) to hardware description language (HDL) compilers allow the FPGA-based kernels to be generated using HLL-based programming rather than HDL-based hardware design. Unfortunately, the loops needed for floating-point reduction operations often cannot be pipelined by these HLL-HDL compilers. This capability gap prevents the development of a number of important FPGA-based kernels. This article describes a novel architecture and algorithm that allow the use of an HLL-HDL environment to implement high-performance FPGA-based kernels that reduce multiple, variable-length sets of floating-point data. A sparse matrix iterative solver is used to demonstrate the effectiveness of the reduction kernel. The FPGA-augmented version running on a contemporary RC is up to 2.4 times faster than the software-only version of the same solver running on the GPP. Conservative estimates show the solver will run up to 6.3 times faster than software on a next-generation RC.
ieee international conference on high performance computing data and analytics | 2009
Khalid H. Abed; Gerald R. Morris
Parallel codes with large-stride/irregular-stride (L/I) memory access patterns, e.g., sparse matrix and linked list codes, often perform poorly on mainstream clusters because of the general purpose processor (GPP) memory hierarchy. High performance reconfigurable computers (HPRCs) are parallel computing clusters containing multiple GPPs and field programmable gate arrays (FPGAs) connected via a high-speed network. In this research, simple 64-bit floating-point parallel codes are used to illustrate the performance impact of L/I memory accesses in software (SW) and FPGA-augmented (FA) codes and to assess the benefits of mapping L/I-type codes onto HPRCs. The experiments reveal that large-stride SW codes, particularly those involving data reuse, experience severe performance degradation compared with unit-stride SW codes. In contrast, large-stride FA codes experience minimal degradation compared with unit-stride FA codes. More importantly, for codes that involve data reuse, the experiments demonstrate performance improvements of up to nearly tenfold for large-stride FA codes compared with large-stride SW codes.
southeastcon | 2011
Nikeya S. Peay; Gerald R. Morris; Khalid H. Abed
One of the newest computational technologies is the high performance heterogeneous computer (HPHC) wherein dissimilar computational devices such as general purpose processors, graphics processors, field programmable gate arrays (FPGAs), etc., are used within a single platform to obtain a computational speedup. Jackson State University has a state-of-art HPHC cluster (an SRC-7), which contains traditional CPUs and reconfigurable processing units. The reconfigurable units are implemented using SRAM-based FPGAs. Currently, the off-the-shelf SRC-7 mechanism for incorporating user components (macros) does not directly support the common case of a multiple file VHDL hierarchy. This research explores a novel approach that allows multiple file VHDL floating-point kernels to be mapped onto the SRC-7. The approach facilitates the development of FPGA-based components via a hybrid technique that uses the SRC Carte compiler in conjunction with multiple file VHDL-based user macros. This research shows how Quartus Wizard-based VHDL floating-point components can be integrated into the Carte development environment.