Jason R. Villarreal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jason R. Villarreal is active.

Explore More

Publication

Featured researches published by Jason R. Villarreal.

field-programmable custom computing machines | 2010

Designing Modular Hardware Accelerators in C with ROCCC 2.0

Jason R. Villarreal; Adrian Park; Walid A. Najjar; Robert J. Halstead

While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15x improvement over hand-written VHDL.

languages compilers and tools for embedded systems | 2003

Profiling tools for hardware/software partitioning of embedded applications

Dinesh C. Suresh; Walid A. Najjar; Frank Vahid; Jason R. Villarreal; Greg Stitt

Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instruction set simulator and can therefore be augmented with architecture and micro-architecture features simulation while the other is based on compile-time instrumentation of gcc and therefore has very little slow down compared to the original program We use the results of the profiling to identify the compute core in each benchmark and study the effect of compile-time optimization on the distribution of cores in a program. We also study the potential speedup that can be achieved using a configurable system on a chip, consisting of a CPU embedded on an FPGA, as an example application of these tools in hardware/software partitioning.

Design Automation for Embedded Systems | 2002

Improving Software Performance with Configurable Logic

Jason R. Villarreal; Dinesh C. Suresh; Greg Stitt; Frank Vahid; Walid A. Najjar

We examine the energy and performance benefits that can be obtained by re-mapping frequently executed loops from a microprocessor to reconfigurable logic. We present a design flow that finds critical software loops automatically and manually re-implements these inconfigurable logic by implementing them in SA-C, a C language variation supportinga dataflow computation model and designed to specify and map DSP applicationsonto reconfigurable logic. We apply this design flow on several examples fromthe MediaBench benchmark suite and report the energy and performance improvements.

field-programmable custom computing machines | 2002

Using on-chip configurable logic to reduce embedded system software energy

Greg Stitt; Brian Grattan; Jason R. Villarreal; Frank Vahid

We examine the energy savings possible by re-mapping critical software loops from a microprocessor to configurable logic appearing on the same-chip in commodity chips now commercially available. That logic is typically intended to implement peripherals and coprocessors without increasing chip count-but we show that reduced software energy is an additional benefit, making such chips even more useful. We find critical software loops and re-implement them in the configurable logic such that a repeating software task completes sooner, allowing us to put the system in a low-power state for longer periods, thus reducing energy. We use simulations and estimations for a hypothetical device having a 32-bit MIPS processor plus configurable logic, yielding energy savings of 25%, increasing to 39% assuming voltage scaling. We physically measured several examples running on two commercial single-chip devices having an 8-bit 8051 microprocessor plus configurable logic and a 32-bit ARM microprocessor with configurable logic, with energy savings of 71% and 53% respectively, increasing to an estimated 89% and 75% assuming voltage scaling.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2015

FHAST : FPGA-Based Acceleration of Bowtie in Hardware

Edward Fernandez; Jason R. Villarreal; Stefano Lonardi; Walid A. Najjar

While the sequencing capability of modern instruments continues to increase exponentially, the computational problem of mapping short sequenced reads to a reference genome still constitutes a bottleneck in the analysis pipeline. A variety of mapping tools (e.g., BOWTIE, BWA) is available for general-purpose computer architectures. These tools can take many hours or even days to deliver mapping results, depending on the number of input reads, the size of the reference genome and the number of allowed mismatches or insertion/deletions, making the mapping problem an ideal candidate for hardware acceleration. In this paper, we present FHAST (FPGA hardware accelerated sequence-matching tool), a drop-in replacement for BOWTIE that uses a hardware design based on field programmable gate arrays (FPGA). Our architecture masks memory latency by executing multiple concurrent hardware threads accessing memory simultaneously. FHAST is composed by multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 and later ported on the Convey HC-2ex, taking advantage of the large memory bandwidth available to these systems and the shared memory image between hardware and software. A preliminary version of FHASTrunning on the Convey HC-1 achieved up to 70× speedup compared to BOWTIE (single-threaded). An improved version of FHAST running on the Convey HC-2ex FPGAs achieved up to 12× fold speed gain compared to BOWTIE running eight threads on an eight-core conventional architecture, while maintaining almost identical mapping accuracy. FHAST is a drop-in replacement for BOWTIE, so it can be incorporated in any analysis pipeline that uses BOWTIE (e.g., TOPHAT).

ACM Transactions on Architecture and Code Optimization | 2010

Impact of high-level transformations within the ROCCC framework

Betul Buyukkurt; John Cortes; Jason R. Villarreal; Walid A. Najjar

Reconfigurable computers, where one or more FPGAs are attached to a conventional microprocessor, are promising platforms for code acceleration. Despite their advantages, programmability concerns and the lack of efficient design tools/compilers for FPGAs are preventing the technologys widespread adoption. The traditional compiler technology is microprocessor-based-systems-specific and needs to be customized and augmented to address the needs in reconfigurable computing. The challenges are several due to the resources and performance constraints for FPGAs being drastically different than those of microprocessors, and also that compiling for FPGAs requires laying the computation in space by a circuit rather than in time by a sequence of instructions. ROCCC is an optimizing C-to-VHDL compiler specifically targeting the reconfigurable computer platforms. ROCCC includes several high-level optimizations that parallelize and optimize the source code for minimized area and critical path length and maximized throughput. This article presents the effect of ROCCCs high-level transformations on the performance of the generated VHDL output. ROCCC utilizes: (1) several array access optimizations to eliminate redundant memory accesses, (2) procedure-level optimizations to achieve circuit area reductions of up to 88% compared to circuit areas generated from unoptimized codes, (3) loop-level optimizations to increase the throughput, and (4) transformations unique to certain classes of applications. The preceding listed features help ROCCC generate circuits with very large degrees of parallelism capable of very high computation rates.

2012 IEEE Conference on High Performance Extreme Computing | 2012

Multithreaded FPGA acceleration of DNA sequence mapping

Edward Fernandez; Walid A. Najjar; Stefano Lonardi; Jason R. Villarreal

In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions of short strings (called reads) against a reference genome. At the time of writing, a representative run requires to match tens of millions of reads of length of about 100 symbols against a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper, we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory image between hardware and software. By comparing the performance of FHAST against Bowtie on the Convey HC-1 we observed up to ~70X improvement in total end-to-end execution time, reducing runs that take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against multiple CPUs in Bowtie.

ieee international conference on high performance computing data and analytics | 2014

Compiling irregular applications for reconfigurable systems

Robert J. Halstead; Jason R. Villarreal; Walid A. Najjar

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an effective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. The CHAT compiler extends the ROCCC toolset to generate customised state information for each dynamically generated thread. Initial results show a speed-up of 50x.

irregular applications: architectures and algorithms | 2011

Exploring irregular memory accesses on FPGAs

Robert J. Halstead; Jason R. Villarreal; Walid A. Najjar

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an effective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. We extend the ROCCC compiler to generate customized state information for each dynamically generated thread.

field-programmable logic and applications | 2008

Compiled hardware acceleration of Molecular Dynamics code

Jason R. Villarreal; Walid A. Najjar

The objective of molecular dynamics (MD) simulations is to determine the shape of a molecule in a given biomolecular environment. These simulations are very demanding computationally, where simulations of a few milliseconds can take days or months depending on the number of atoms involved. Therefore, MD simulations are a prime candidate for FPGA-based code acceleration. We have investigated the possible acceleration of the commonly used MD program NAMD. This code is highly optimized for software based execution and does not benefit from an FPGA-based acceleration as written. We have therefore developed a modified version, based on the calculations NAMD performs, that streams a set of data through a highly pipelined circuit on the FPGA. We have used the ROCCC compiler toolset to generate the circuit and implemented it on the SGI Altix 4700 fitted with a RASC RC100 blade.

Explore More