Venkata Jakkula | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Venkata Jakkula is active.

Explore More

Publication

Featured researches published by Venkata Jakkula.

international symposium on computer architecture | 2010

A dynamically configurable coprocessor for convolutional neural networks

Srimat T. Chakradhar; Murugan Sankaradas; Venkata Jakkula; Srihari Cadambi

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moores law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.

application specific systems architectures and processors | 2009

A Massively Parallel Coprocessor for Convolutional Neural Networks

Murugan Sankaradas; Venkata Jakkula; Srihari Cadambi; Srimat T. Chakradhar; Igor Durdanovic; Eric Cosatto; Hans Peter Graf

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

field-programmable custom computing machines | 2009

A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines

Srihari Cadambi; Igor Durdanovic; Venkata Jakkula; Murugan Sankaradass; Eric Cosatto; Srimat T. Chakradhar; Hans Peter Graf

We present a massively parallel FPGA-based coprocessor for Support Vector Machines (SVMs), a machine learning algorithm whose applications include recognition tasks such as learning scenes, situations and concepts, and reasoning tasks such as analyzing the recognized scenes and semantics. The coprocessor architecture, targeted at both SVM training and classification, is based on clusters of vector processing elements (VPEs) operating in single-instruction multiple data (SIMD) mode to take advantage of large amounts of data parallelism in the application. We use the FPGA’s DSP elements as parallel multiply-accumulators (MACs), a core computation in SVMs. A key feature of the architecture is that it is customized to low precision arithmetic which permits one DSP unit to perform two or more MACs in parallel. Low precision also reduces the required number of parallel off-chip memory accesses by packing multiple data words on the FPGA-memory bus. We have built a prototype using an off-the-shelf PCI-based FPGA card with a Xilinx Virtex 5 FPGA and 1GB DDR2 memory. For SVM training, we observe application-level end-to-end computation speeds of over 9 billion multiply-accumulates per second (GMACs). For SVM classification, using data packing, the application speed increases to 14 GMACs. The FPGA-based system is about 20x faster than a dual Opteron 2.2 GHz processor CPU, and dissipates around 10W of power.

design automation conference | 2003

CoCo: a hardware/software platform for rapid prototyping of code compression technologies

Haris Lekatsas; Jörg Henkel; Srimat T. Chakradhar; Venkata Jakkula; Murugan Sankaradass

In recent years, instruction code compression/decompression technologies have emerged as an efficient way to: a) reduce the memory usage of an embedded system, b) to improve performance through effective higher bandwidths and/or to c) reduce the overall power consumption of a system processing compressed code. We have presented efficient code compression/decompression techniques and architectures in the past. For the commercialization phase, we designed a novel hardware/software code compression/decompression platform (CoCo). It consists of a software platform that prepares, optimizes, compresses and compiles instruction code and a generic, parameterizable FPGA-based hardware architecture in form of a hardware platform that allows to rapidly evaluate prototypes of diverse compression/decompression technologies. We show the flexibility of CoCo, its ability to achieve code compression ratios (parameterizable) of up to 50% with a slight system performance gain and its ability to apply compression in a real-world compiled code without any limitations where others have made implicit software-restrictive assumptions.

international conference on vlsi design | 2005

A unified architecture for adaptive compression of data and code on embedded systems

Haris Lekatsas; Jörg Henkel; Venkata Jakkula; Srimat T. Chakradhar

We present an architecture for compression/decompression of executable files running on embedded systems. Compression is important for memory reduction purposes; previous work on memory reduction for embedded systems has focused on compressing the instruction segment of executable code before execution and decompressing at runtime. Our work has shown that solely compressing the instruction segment is not enough as in many cases executable files contain large data areas that would benefit from compression as well. Compressing data areas presents new challenges to the embedded system designer; data can be modified during execution and therefore a fast compression algorithm and intelligent memory management are required as well. We propose a novel compression/decompression framework that can handle both instructions and data and show memory reductions over 50% while keeping performance degradation within 12%.

IEEE Design & Test of Computers | 2004

Cypress: compression and encryption of data and code for embedded multimedia systems

Haris Lekatsas; Jörg Henkel; Srimat T. Chakradhar; Venkata Jakkula

Copyright protection of sensitive data plays a significant part in the design of multimedia systems. This article introduces a hardware platform that enables both compression and encryption for data and code in a unified architecture. Besides being parameterizable, the platform features software tools for evaluating and optimizing specific multimedia applications.

international conference on vlsi design | 2006

Using shiftable content addressable memories to double memory capacity on embedded systems

Haris Lekatsas; Jörg Henkel; Venkata Jakkula; Srimat T. Chakradhar

We present a novel algorithm and architecture for memory compression using a series of shiftable content addressable memories (S-CAMs). The main contribution of this new algorithm is the use of a combination of an adaptive shared dictionary used across all memory pages, with one or more local adaptive dictionaries which are flushed after compressing a single memory page. Compression/decompression is capable of handling various types of memory content including application code and data. To this end, a fast compression/decompression architecture is necessary to move code and data from the non-compressed levels of memory hierarchy to the compressed ones and vice versa. Our technique takes advantage of fast parallel comparisons that can be achieved by S-CAMs to implement our dictionary based compression algorithm. Our results show memory reductions that are substantially better (more than doubling the available memory for certain applications) than existing CAM-based memory approaches such as X-Match Pro. These results have been achieved by using the newest embedded processor architectures such as the Xtensa platform that feature a dense instruction word encoding even without compression.

international symposium on computer architecture | 2006