Kevin R. Townsend
Iowa State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kevin R. Townsend.
international parallel and distributed processing symposium | 2014
Osama G. Attia; Tyler Johnson; Kevin R. Townsend; Phillip H. Jones; Joseph Zambreno
Large-scale graph structures are considered as a keystone for many emerging high-performance computing applications in which Breadth-First Search (BFS) is an important building block. For such graph structures, BFS operations tends to be memory-bound rather than compute-bound. In this paper, we present an efficient reconfigurable architecture for parallel BFS that adopts new optimizations for utilizing memory bandwidth. Our architecture adopts a custom graph representation based on compressed-sparse raw format (CSR), as well as a restructuring of the conventional BFS algorithm. By taking maximum advantage of available memory bandwidth, our architecture continuously keeps our processing elements active. Using a commercial high-performance reconfigurable computing system (the Convey HC-2), our results demonstrate a 5× speedup over previously published FPGA-based implementations.
formal methods | 2012
Chad Nelson; Kevin R. Townsend; Bhavani Satyanarayana Rao; Phillip H. Jones; Joseph Zambreno
The mapping of many short sequences of DNA, called reads, to a long reference genome is an common task in molecular biology. The task amounts to a simple string search, allowing for a few mismatches due to mutations and inexact read quality. While existing solutions attempt to align a high percentage of the reads using small memory footprints, Shepard is concerned with only exact matches and speed. Using the human genome, Shepard is on the order of hundreds of thousands of times faster than current software implementations such as SOAP2 or Bowtie, and about 60 times faster than GPU implementations such as SOAP3. Shepard contains two components: a software program to preprocess a reference genome into a hash table, and a hardware pipeline for performing fast lookups. The hash table has one entry for each unique 100 base pair sequence that occurs in the reference genome, and contains the index of last occurrence and the number of occurrences. To reduce the hash table size, a minimal perfect hash table is used. The hardware pipeline was designed to perform hash table lookups very quickly, on the order of 600 million lookups per second, and was implemented on a Convey HC-1 high performance reconfigurable computing system. Shepard streams all of the short reads through a custom hardware pipeline and writes the alignment data (index of last occurrence and number of occurrences) to a binary results array.
application specific systems architectures and processors | 2013
Kevin R. Townsend; Joseph Zambreno
Sparse Matrix Vector Multiplication (SpMV) is an important computational kernel in many scientific computing applications. Pipelining multiply-accumulate operations shifts SpMV from a computationally bounded kernel to an I/O bounded kernel. In this paper, we propose a design methodology and hardware architecture for SpMV that seeks to utilize system memory bandwidth as efficiently as possible, by Reducing the matrix element storage with on-chip decompression hardware, Reusing the vector data by mixing row and column matrix traversal, and Recycling data with matrix-dependent on-chip storage. Our experimental results with a Convey HC-1/HC-2 reconfigurable computing system indicate that for certain sparse matrices, our R3 methodology performs twice as fast as previous reconfigurable implementations, and effectively competes against other platforms.
electro information technology | 2015
Kevin R. Townsend; Joseph Zambreno
This paper presents a lossless double-precision floating point compression algorithm. Floating point compression can reduce the cost of storing and transmitting large amounts of data associated with big data problems. A previous algorithm called FPC performs well and uses predictors. However, predictors have limitations. Our program (fzip) overcomes some of these limitations, fzip has 2 phases, first BWT compression, second value and prefix compression with variable length arithmetic encoding. This approach has the advantage that the phases work together and each phase compresses a different type of pattern. On average, fzip achieves a 20% higher compression ratio than other algorithms.
electro information technology | 2015
Kevin R. Townsend; Song Sun; Tyler Johnson; Osama G. Attia; Phillip H. Jones; Joseph Zambreno
Text classification is an important enabling technology for a wide range of applications such as Internet search, email filtering, network intrusion detection, and data mining electronic documents in general. The k Nearest Neighbors (k-NN) text classification algorithm is among the most accurate classification approaches, but is also among the most computationally expensive. In this paper, we propose accelerating k-NN using a novel reconfigurable hardware based architecture. More specifically, we accelerate a k-NN applications core with an FPGA-based sparse matrix vector multiplication coprocessor. On average our implementation shows a speed up factor of 15 over a naïve single threaded CPU implementation of k-NN text classification for our datasets, and a speed up factor of 1.5 over a 32-threaded parallelized CPU implementation.
custom integrated circuits conference | 2013
Tao Zeng; Kevin R. Townsend; Jingbo Duan; Degang Chen
Device variability has become one of the fundamental challenges to high-resolution and high-accuracy DACs in nanometer and emerging processes. This paper introduces a 15-bit binary-weighted current-steering DAC in a standard 130nm CMOS technology, which utilizes a new random mismatch compensation theory called ordered element matching to improve the static linearity performance with the presence of large variability. The chips core area is less than 0.42mm2, among which the 7-bit MSB current source area is well within 0.021mm2. Measurement results have shown that the DACs DNL and INL can be reduced from 9.85LSB and 17.41LSB to 0.34LSB and 0.77LSB, respectively.
reconfigurable computing and fpgas | 2015
Osama G. Attia; Alex Grieve; Kevin R. Townsend; Phillip H. Jones; Joseph Zambreno
In this paper, we study the design and implementation of a reconfigurable architecture for graph processing algorithms. The architecture uses a message-passing model targeting shared-memory multi-FPGA platforms. We take advantage of our architecture to showcase a parallel implementation of the all-pairs shortest path algorithm (APSP) for unweighted directed graphs. Our APSP implementation adopts a fine-grain processing methodology while attempting to minimize communication and synchronization overhead. Our design utilizes 64 parallel kernels for vertex-centric processing. We evaluate a prototype of our system on a Convey HC-2 high performance computing platform, in which our performance results demonstrates an average speedup of 7:9× over the sequential APSP algorithm and an average speedup of 2:38× over a multi-core implementation.
IEEE Transactions on Parallel and Distributed Systems | 2016
Chad Nelson; Kevin R. Townsend; Osama G. Attia; Phillip H. Jones; Joseph Zambreno
The alignment of many short sequences of DNA, called reads, to a long reference genome is a common task in molecular biology. When the problem is expanded to handle typical workloads of billions of reads, execution time becomes critical. In this paper we present a novel reconfigurable architecture for minimal perfect sequencing (RAMPS). While existing solutions attempt to align a high percentage of the reads using a small memory footprint, RAMPS focuses on performing fast exact matching. Using the human genome as a reference, RAMPS aligns short reads hundreds of thousands of times faster than current software implementations such as SOAP2 or Bowtie, and about a thousand times faster than GPU implementations such as SOAP3. Whereas other aligners require hours to preprocess reference genomes, RAMPS can preprocess the reference human genome in a few minutes, opening the possibility of using new reference sources that are more genetically similar to the newly sequenced data.
ACM Transactions on Reconfigurable Technology and Systems | 2016
Osama G. Attia; Kevin R. Townsend; Phillip H. Jones; Joseph Zambreno
The Strongly Connected Components (SCCs) detection algorithm serves as a keystone for many graph analysis applications. The SCC execution time for large-scale graphs, as with many other graph algorithms, is dominated by memory latency. In this article, we investigate the design of a parallel hardware architecture for the detection of SCCs in directed graphs. We propose a design methodology that alleviates memory latency and problems with irregular memory access. The design is composed of 16 processing elements dedicated to parallel Breadth-First Search (BFS) and eight processing elements dedicated to finding intersection in parallel. Processing elements are organized to reuse resources and utilize memory bandwidth efficiently. We demonstrate a prototype of our design using the Convey HC-2 system, a commercial high-performance reconfigurable computing coprocessor. Our experimental results show a speedup of as much as 17× for detecting SCCs in large-scale graphs when compared to a conventional sequential software implementation.
International Journal of Reconfigurable Computing | 2015
Kevin R. Townsend; Osama G. Attia; Phillip H. Jones; Joseph Zambreno
On-chip multiport memory cores are crucial primitives for many modern high-performance reconfigurable architectures and multicore systems. Previous approaches for scaling memory cores come at the cost of operating frequency, communication overhead, and logic resources without increasing the storage capacity of the memory. In this paper, we present two approaches for designing multiport memory cores that are suitable for reconfigurable accelerators with substantial on-chip memory or complex communication. Our design approaches tackle these challenges by banking RAM blocks and utilizing interconnect networks which allows scaling without sacrificing logic resources. With banking, memory congestion is unavoidable and we evaluate our multiport memory cores under different memory access patterns to gain insights about different design trade-offs. We demonstrate our implementation with up to 256 memory ports using a Xilinx Virtex-7 FPGA. Our experimental results report high throughput memories with resource usage that scales with the number of ports.