Deependra Talla
Texas Instruments
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Deependra Talla.
IEEE Transactions on Computers | 2003
Deependra Talla; Lizy Kurian John; Doug Burger
Multimedia SIMD extensions such as MMX and AltiVec speed up media processing; however, our characterization shows that the attributes of current general-purpose processors enhanced with SIMD extensions do not match very well with the access patterns and loop structures of media programs. We find that 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions necessary to feed the SIMD execution units rather than true/useful computations, resulting in the underutilization of SIMD execution units (only 1 to 12 percent of the peak SIMD execution units throughput is achieved). Contrary to focusing on exploiting more data-level parallelism (DLP), we focus on the instructions that support the SIMD computations and exploit both fine and coarse-grained instruction level parallelism (ILP) in the supporting instruction stream. We propose the MediaBreeze architecture that uses hardware support for efficient address generation, looping, and data reorganization (permute, packing/unpacking, transpose, etc.). Our results on multimedia kernels show that a 2-way processor with SIMD extensions enhanced with MediaBreeze provides a better performance than a 16-way processor with current SIMD extensions. In the case of application benchmarks, a 2-/4-way processor with SIMD extensions augmented with MediaBreeze outperforms a 4-/8-way processor with SIMD extensions. A first-order approximation using ASIC synthesis tools and cell-based libraries shows that this acceleration is achieved at a 10 percent increase in area required by MMX and SSE extensions (0.3 percent increase in overall chip area) and 1 percent of total processor power consumption.
international conference on computer design | 2000
Deependra Talla; Lizy Kurian John; Viktor S. Lapinskii; Brian L. Evans
This paper aims to provide a quantitative understanding of the performance of DSP and multimedia applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors. We evaluate the performance of the VLIW paradigm using Texas Instruments Inc.s TMS320C62xx processor and the SIMD paradigm using Intels Pentium II processor (with MMX) on a set of DSP and media benchmarks. Tradeoffs in superscalar performance are evaluated with a combination of measurements on Pentium II and simulation experiments on the SimpleScalar simulator. Our benchmark suite includes kernels (filtering, autocorrelation, and dot product) and applications (audio effects, G.711 speech coding, and speech compression). Optimized assembly libraries and compiler intrinsics were used to create the SIMD and VLIW code. We used the hardware performance counters on the Pentium II and the stand-alone simulator for the C62xx to obtain the execution cycle counts. In comparison to non-SIMD Pentium II performance, the SIMD version exhibits a speedup ranging from 1.0 to 5.5 while the speedup of the VLIW version ranges from 0.63 to 9.0. The benchmarks are seen to contain large amounts of available parallelism, however, most of it is inter-iteration parallelism. Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications.
international performance computing and communications conference | 2000
Deependra Talla; Lizy Kurian John
With the widespread use of 3D graphics, animation, speech recognition, and other media applications, general-purpose processors are increasingly spending their cycles on video and audio processing. However, the characteristics of media applications when executed on general purpose processors are not well understood. Such knowledge is extremely important in guiding the design of future microprocessors and development of media applications. In this paper we characterize the performance of multimedia applications on art Intel Pentium II processor based system. Six different commercial multimedia applications belonging to 3D graphics, streaming video or streaming audio categories are executed on an Intel Pentium II processor and performance is measured. Architectural data pertaining to utilization of various hardware resources on the chip are collected using on-chip performance monitoring counters. Multimedia applications are seen to have fewer branch instructions than SPECint benchmarks, however more than SPECfp benchmarks. Despite a regular control flow and more available parallelism, the average number of cycles taken to execute an instruction is seen to be higher than that of SPECint. In many aspects, media applications exhibit a behavior between that of SPECint and SPECfp.
international conference on computer design | 2001
Deependra Talla; Lizy Kurian John
General-purpose microprocessors augmented with SIMD execution units enhance multimedia applications by exploiting data level parallelism. However, supporting/overhead related instructions, accounting for 75-85% of the dynamic instructions, leads to an under-utilization of SIMD execution units resulting in a throughput that ranges between 1-12% of the peak throughput. We accelerate multimedia applications by providing explicit hardware support to eliminate or reduce the impact of the supporting/overhead related instructions. Performance evaluation shows that such hardware can significantly improve performance over conventional SIMD enhanced general-purpose processors. We investigate the cost of incorporating hardware, for efficient execution of supporting/overhead related instructions, into a high-speed SIMD enhanced general-purpose processor and perform area, power, and timing tradeoffs. Our results indicate that - the added hardware requires less than 10% SIMD execution units chip area and 0.3% overall chip area, and power consumption is less than 1% of the total processor power. This is achieved without elongating the critical path of the processor.
ACM Sigarch Computer Architecture News | 2001
Deependra Talla; Lizy Kurian John
Decoupled architectures are fine-grain processors that partition the memory access and execute functions in a computer program and exploit the parallelism between the two functions. Although some concepts from the traditional decoupled access execute paradigm made its way into commercial processors, they encountered resistance in general-purpose applications because these applications are not very structured and regular. However, multimedia applications have recently become dominant workload on desktops and workstations. Media applications are very structured and regular and lend themselves well to the decoupling concept. In this paper, we present an architecture that decouples the useful/true computations from the overhead/supporting instructions in media applications. The proposed scheme is incorporated into an out-of-order general-purpose processor enhanced with SIMD extensions. Explicit hardware support is provided to exploit instruction level parallelism in the overhead component. Performance evaluation shows that such hardware can significantly improve performance over conventional SIMD enhanced general-purpose processors. Results on nine multimedia benchmarks show that the proposed MediaBreeze architecture provides a 1.05x to 16.7x performance improvement over a 2-way out-of-order SIMD machine. On introducing slip-based data prefetching, a performance improvement up to 28x is observed.
european conference on parallel processing | 1999
Deependra Talla; Lizy Kurian John
DSP processor growth is phenomenal and continues to grow rapidly, but general-purpose microprocessors have entered the multimedia and signal processing oriented stream by adding DSP functionality to the instruction set and also providing optimized assembly libraries. In this paper, we compare the performance of a general-purpose processor (Pentium II with MMX) versus a DSP processor (TIs C62xx) by evaluating the effectiveness of VLIW style parallelism in the C62xx versus the SIMD parallelism in MMX on the Intel P6 microarchitecture. We also compare the execution speed of reliable, standard, and efficient C code with respect to the signal processing library (from Intel) by benchmarking a suite of DSP algorithms. We observed that the C62xx exhibited a speedup (ratio of execution clock cycles) ranging from 1.3 up to 4.0 over the Pentium II, and the NSP libraries had a speedup ranging from 0.8 to over 10 over the C code.
midwest symposium on circuits and systems | 1999
Deependra Talla; Lizy Kurian John
Effectiveness of MMX for digital signal processing on general-purpose processors (often referred to as Native Signal Processing) is evaluated. We benchmarked a variety of signal processing algorithms and obtained speedup (ratio of execution speeds) ranging from 1.0 to 4.0 by using MMX technology over non-MMX code. Efficient, reliable and standard C code is also evaluated with respect to NSP library assembly code from Intel.
international conference on information technology research and education | 2003
Deependra Talla; Lizy Kurian John
The past few years have seen significant research activity in understanding and characterizing multimedia applications on general-purpose processors (GPPs). While many of the commonly thought of features turned out to be true, there were a few myths that have been unraveled. We present a few facts and myths that we came across while doing performance evaluation and benchmarking studies on multimedia applications from 1998-2002.
Archive | 2004
Damon Domke; Youngjun Yoo; Deependra Talla; Ching-Yu Hung
Archive | 2010
David E. Smith; Deependra Talla; Clay Dunsmore; Ching-Yu Hung