Robert J. Halstead | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert J. Halstead is active.

Explore More

Publication

Featured researches published by Robert J. Halstead.

field-programmable custom computing machines | 2010

Designing Modular Hardware Accelerators in C with ROCCC 2.0

Jason R. Villarreal; Adrian Park; Walid A. Najjar; Robert J. Halstead

While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15x improvement over hand-written VHDL.

Proceedings of the IEEE | 2015

High-Level Language Tools for Reconfigurable Computing

Skyler Windh; Xiaoyin Ma; Robert J. Halstead; Prerna Budhkar; Zabdiel Luna; Omar Hussaini; Walid A. Najjar

In the past decade or so we have witnessed a steadily increasing interest in FPGAs as hardware accelerators: they provide an excellent mid-point between the reprogrammability of software devices (CPUs, DSPs, and GPUs) and the performance and low energy consumption of ASICs. However, the programmability of FPGA-based accelerators remains one of the biggest obstacles to their wider adoption. Developing FPGA programs requires extensive familiarity with hardware design and experience with a tedious and complex tool chain. For half a century, layers of abstractions have been developed that simplify the software development process: languages, compilers, dynamically linked libraries, operating systems, APIs, etc. Very little, if any, such abstractions exist in the development of FPGA programs. In this paper, we review the history of using FPGAs as hardware accelerators and summarize the challenges facing the raising of the programming abstraction layers. We survey five High-Level Language tools for the development of FPGA programs: Xilinx Vivado, Altera OpenCL, BluespecBSV, ROCCC, and LegUp to provide an overview of their tool flow, the optimizations they provide, and a qualitative analysis of their hardware implementations of high level code.

field-programmable custom computing machines | 2013

Accelerating Join Operation for Relational Databases with FPGAs

Robert J. Halstead; Bharat Sukhwani; Hong Min; Mathew S. Thoennes; Parijat Dube; Sameh W. Asaad; Balakrishna R. Iyer

In this paper, we investigate the use of field programmable gate arrays (FPGAs) to accelerate relational joins. Relational join is one of the most CPU-intensive, yet commonly used, database operations. Hashing can be used to reduce the time complexity from quadratic (naïve) to linear time. However, doing so can introduce false positives to the results which must be resolved. We present a hash-join engine on FPGA that performs hashing, conflict resolution, and joining on a PCIe-attached system, achieving greater than 11x speedup over software.

compilers architecture and synthesis for embedded systems | 2013

Compiled multithreaded data paths on FPGAs for dynamic workloads

Robert J. Halstead; Walid A. Najjar

Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized to each individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC-2ex machine. We describe how this compilation approach differs from other C to HDL compilers. We use the compiler to generate a multithreaded sparse matrix vector multiplication kernel and compare its performance to existing FPGA, and highly optimized software implementations.

ieee international conference on high performance computing data and analytics | 2014

Compiling irregular applications for reconfigurable systems

Robert J. Halstead; Jason R. Villarreal; Walid A. Najjar

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an effective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. The CHAT compiler extends the ROCCC toolset to generate customised state information for each dynamically generated thread. Initial results show a speed-up of 50x.

irregular applications: architectures and algorithms | 2011

Exploring irregular memory accesses on FPGAs

Robert J. Halstead; Jason R. Villarreal; Walid A. Najjar

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an effective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. We extend the ROCCC compiler to generate customized state information for each dynamically generated thread.

ACM Transactions in Embedded Computing Systems | 2014

A study on parallelizing XML path filtering using accelerators

Roger Moussalli; Mariam Salloum; Robert J. Halstead; Walid A. Najjar; Vassilis J. Tsotras

Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-based systems. XML-based pub-sub systems provide users with more flexibility by allowing the formulation of complex queries on the content as well as the structure of the streaming messages. Messages that match a given user query are forwarded to the user. This article examines how to exploit the parallelism found in XPath filtering. Using an incoming XML stream, parsing and matching thousands of user profiles are performed simultaneously by matching engines. We show the benefits and trade-offs of mapping the proposed filtering approach onto FPGAs, processing streams of XML at wire speed, and GPUs, providing the flexibility of software. This is in contrast to conventional approaches bound by the sequential aspect of software computing, associated with a large memory footprint. By converting XPath expressions into custom stacks, our solution is the first to provide support for complex XPath structural constructs, such as parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion. The measured speedups resulting from the GPU and FPGA accelerations versus single-core CPUs are up to 6.6X and 2.5 orders of magnitude, respectively. The FPGA approaches are up to 31X faster than software running on 12 CPU cores.

data management on new hardware | 2016

FPGA-accelerated group-by aggregation using synchronizing caches

Ildar Absalyamov; Prerna Budhkar; Skyler Windh; Robert J. Halstead; Walid A. Najjar; Vassilis J. Tsotras

Recent trends in hardware have dramatically dropped the price of RAM and shifted focus from systems operating on disk-resident data to in-memory solutions. In this environment high memory access latency, also known as memory wall, becomes the biggest data processing bottleneck. Traditional CPU-based architectures solved this problem by introducing large cache hierarchies. However algorithms which experience poor locality can limit the benefits of caching. In turn, hardware multithreading provides a generic solution that does not rely on algorithm-specific locality properties. In this paper we present an FPGA-accelerated implementation of in-memory group-by hash aggregation. Our design relies on hardware multithreading to efficiently mask long memory access latency by implementing a custom operation datapath on FPGA. We propose using CAMs (Content Addressable Memories) as a mechanism of synchronization and local pre-aggregation. To the best of our knowledge this is the first work, which uses CAMs as a synchronizing cache. We evaluate aggregation throughput against the state-of-the-art multithreaded software implementations and demonstrate that the FPGA-accelerated approach significantly outperforms them on large grouping key cardinalities and yields speedup up to 10x.

asilomar conference on signals, systems and computers | 2010

Is there a tradeoff between programmability and performance

Robert J. Halstead; Jason R. Villarreal; Roger Moussalli; Walid A. Najjar

While the computational power of Field Programmable Gate Arrays (FPGA) makes them attractive as code accelerators, the lack of high-level language programming tools is a major obstacle to their wider use. Graphics Processing Units (GPUs), on the other hand, have benefitted from advanced and widely used high-level programming tools. This paper evaluates the performance, throughput and energy of both FPGAs and GPUs on image processing codes using high-level language programming tools for both.

FPGAs for Software Programmers | 2016

ROCCC 2.0

Walid A. Najjar; Jason R. Villarreal; Robert J. Halstead

Riverside optimizing compiler for configurable computing (ROCCC) was started as a project at The University of California, Riverside in 2002. To put in a historical context: Field programmable gate arrays (FPGAs) were much smaller, and slower, then they are today (2015); Graphics processing units (GPUs) were used exclusively for graphics; reconfigurable computing was taking shape as a research area but not yet within the main stream of academic research, let alone in industrial production. However, multiple research projects had already demonstrated, many times over, the clear advantages and potentials of this nascent paradigm as an alternative that combines the re-programmability advantages of fixed data path devices (Central processing units (CPUs), Digital signal processors (DSPs) and GPUs) with the high speed of custom hardware (Application-specific integrated circuits (ASICs)). Within that time frame, the nearly exclusive focus of reconfigurable computing was on signal and image processing because of their streaming nature. Video processing was considered a future possibility to be realized when the size (area) and bandwidth capabilities of FPGAs got larger.

Explore More