Is this you? Create Your Porfile

Arpith C. Jacob

Washington University in St. Louis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arpith C. Jacob is active.

Explore More

Publication

Featured researches published by Arpith C. Jacob.

ACM Transactions on Reconfigurable Technology and Systems | 2008

Mercury BLASTP: Accelerating Protein Sequence Alignment

Arpith C. Jacob; Joseph M. Lancaster; Jeremy Buhler; Brandon Harris; Roger D. Chamberlain

Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this article, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11--15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results.

IEEE Computer | 2008

Hardware Technologies for High-Performance Data-Intensive Computing

Maya Gokhale; Jonathan D. Cohen; Andy Yoo; William Marcus Miller; Arpith C. Jacob; Craig D. Ulmer; Roger A. Pearce

Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements. Experiments with three benchmarks suggest that emerging hardware technologies can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.

field-programmable custom computing machines | 2007

FPGA-accelerated seed generation in Mercury BLASTP

Arpith C. Jacob; Joseph M. Lancaster; Jeremy Buhler; Roger D. Chamberlain

BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more runtime or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we focus on seed generation, the first stage of the BLASTP algorithm. Our seed generator is capable of processing database residues at up to 219 Mresidues/second for 2048- residue queries. The full Mercury BLASTP pipeline, including our seed generator, achieves a speedup of 37times over the popular NCBI BLASTP software on a 2.8 GHz Intel P4 CPU, with sensitivity more than 99% that of the software. Our architecture can be generalized to accelerate the seed generation stage in other important biocomputing applications.A technique is presented which allows an FPGA-based reconfigurable system-on-chip to automatically and dynamically load hardware peripheral controllers and software device drivers depending on the systems automated identification of peripheral boards which are connected to the FPGA. The technique loads peripheral detection modules into peripheral controller slots at system startup, and after these modules identify the peripheral, the correct hardware controllers and software drivers are loaded.

application specific systems architectures and processors | 2008

Accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs

Arpith C. Jacob; Jeremy Buhler; Roger D. Chamberlain

RNA structure prediction, or folding, is a compute-intensive task that lies at the core of several search applications in bioinformatics. We begin to address the need for high-throughput RNA folding by accelerating the Nussinov folding algorithm using a 2D systolic array architecture. We adapt classic results on parallel string parenthesization to produce efficient systolic arrays for the Nussinov algorithm, elaborating these array designs to produce fully realized FPGA implementations. Our designs achieve estimated speedups up to 39times on a Xilinx Virtex-II 6000 FPGA over a modern x86 CPU.

field-programmable logic and applications | 2007

A Banded Smith-Waterman FPGA Accelerator for Mercury BLASTP

Brandon Harris; Arpith C. Jacob; Joseph M. Lancaster; Jeremy Buhler; Roger D. Chamberlain

Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. The popular BLASTP software for this task has become a bottleneck for proteomic database search. One third of this softwares time is spent executing the Smith-Waterman dynamic programming algorithm. This work describes a novel FPGA design for banded Smith-Waterman, an algorithmic variant tuned to the needs of BLASTP. This design has been implemented in Mercury BLASTP, our FPGA-accelerated version of the BLASTP algorithm. We show that Mercury BLASTP runs 6-16 times faster than software BLASTP on a modern CPU while delivering 99% identical results.

international parallel and distributed processing symposium | 2007

Preliminary results in accelerating profile HMM search on FPGAs

Arpith C. Jacob; Joseph M. Lancaster; Jeremy Buhler; Roger D. Chamberlain

Comparison between biosequences and probabilistic models is an increasingly important part of modern DNA and protein sequence analysis. The large and growing number of such models in todays databases demands computational approaches to searching these databases faster, while maintaining high sensitivity to biologically meaningful similarities. This work describes an FPGA-based accelerator for comparing proteins to hidden Markov models of the type used to represent protein motifs in the popular HM-MER motif finder. Our engine combines a systolic array design with enhancements to pipeline the complex Viterbi calculation that forms the core of the comparison, and to support coarse-grained parallelism and streaming of multiple sequences within one FPGA. Performance estimates based on a functioning VHDL realisation of our design show a 190 times speedup over the same computation in optimised software on a modern general-purpose CPU.

field-programmable custom computing machines | 2010

Rapid RNA Folding: Analysis and Acceleration of the Zuker Recurrence

Arpith C. Jacob; Jeremy Buhler; Roger D. Chamberlain

RNA folding is a compute-intensive task that lies at the core of search applications in bioinformatics such as RNAfold and UNAFold. In this work, we analyze the Zuker RNA folding algorithm, which is challenging to accelerate because it is resource intensive and has a large number of variable-length dependencies. We use a technique of Lyngso to rewrite the recurrence in a form that makes polyhedral analysis more effective and use data pipelining and tiling to generate a hardware-friendly implementation. Compared to earlier work, processors in our array are more efficient and use fewer logic and memory resources. We implemented our array on a Xilinx Virtex 4 LX100-12 FPGA and experimentally verified a 103x speedup over a single core of a 3 GHz Intel Core 2 Duo CPU. The accelerator is also 17x faster than a recent Zuker implementation on a Virtex 4 LX200-11 FPGA and 12x and 6x faster respectively than an Nvidia Tesla C870 and GTX280 GPU. We conclude with a number of lessons in using FPGAs to implement arrays after polyhedral analysis. We advocate using polyhedral analysis to accelerate other dynamic programming recurrences in computational biology.

international workshop on high performance reconfigurable computing technology and applications | 2007

Language classification using n-grams accelerated by FPGA-based Bloom filters

Arpith C. Jacob; Maya Gokhale

N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.

application specific systems architectures and processors | 2010

Design of throughput-optimized arrays from recurrence abstractions

Arpith C. Jacob; Jeremy Buhler; Roger D. Chamberlain

Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the arrays projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.

field-programmable custom computing machines | 2006

Scalable Softcore Vector Processor for Biosequence Applications

Arpith C. Jacob; Brandon Harris; Jeremy Buhler; Roger D. Chamberlain; Young H. Cho

Currently available genome databases are growing exponentially in size, making it difficult for software analysis tools to keep up. A number of hardware accelerators utilizing special purpose VLSI (Blas, et al., 2005) or reconfigurable hardware (Hoang, 1993) have been proposed. However, they are inflexible; support for new applications usually requires a laborious redesign. None of these accelerators can be easily adapted to other applications that require differing hardware resources. The design philosophy of the softcore vector processor is based on two important goals: adaptability and performance. Instruction based execution allows programmable support for a large number of algorithms. The fact that different classes of applications require different subsets of hardware resources, argues for a customizable hardware design built from primitives. The second goal was to achieve programmability without sacrificing performance. The SVP was designed to perform competitively with full custom solutions available in the market

Explore More