Eric J. Stotzer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eric J. Stotzer is active.

Explore More

Publication

Featured researches published by Eric J. Stotzer.

international parallel and distributed processing symposium | 2009

Implementing OpenMP on a high performance embedded multicore MPSoC

Barbara M. Chapman; Lei Huang; Eric Biscondi; Eric J. Stotzer; Ashish Rai Shrivastava; Alan Gatherer

In this paper we discuss our initial experiences adapting OpenMP to enable it to serve as a programming model for high performance embedded systems. A high-level programming model such as OpenMP has the potential to increase programmer productivity, reducing the design/development costs and time to market for such systems. However, OpenMP needs to be extended if it is to meet the needs of embedded application developers, who require the ability to express multiple levels of parallelism, real-time and resource constraints, and to provide additional information in support of optimization. It must also be capable of supporting the mapping of different software tasks, or components, to the devices configured in a given architecture.

international workshop on openmp | 2011

OpenMP for accelerators

James C. Beyer; Eric J. Stotzer; Alistair Hart; Bronis R. de Supinski

OpenMP [14] is the dominant programming model for shared-memory parallelism in C, C++ and Fortran due to its easy-touse directive-based style, portability and broad support by compiler vendors. Compute-intensive application regions are increasingly being accelerated using devices such as GPUs and DSPs, and a programming model with similar characteristics is needed here. This paper presents extensions to OpenMP that provide such a programming model. Our results demonstrate that a high-level programming model can provide accelerated performance comparable to that of hand-coded implementations in CUDA.

languages compilers and tools for embedded systems | 2001

Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture

Elana D. Granston; Eric J. Stotzer; Joe Zbiciak

The TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) loops. More recently, software pipelining has been extended to irregular (WHILE) loops, but only on architectures that provide special-purpose hardware such as rotating (predicate and general-purpose) register files, specific instructions for filling/draining software pipelined loops, and possibly hardware support for speculative code motion. In contrast, the TMS320C6000 family has a limited, static register file and no specialized hardware beyond the ability to predicate instructions using a few static registers. In this paper, we describe our experience extending a production compiler for the TMS320C6000 family to software pipeline irregular loops. We discuss our technique for preprocessing irregular loops so that they can be handled by the existing software pipeliner. Our approach is much simpler than previous approaches and works very well in the presence of the DSP applications and the target architecture which characterize our environment. With this optimization, we achieve impressive speedups on several key DSP and non-DSP algorithms.

languages, compilers, and tools for embedded systems | 1999

Modulo scheduling for the TMS320C6x VLIW DSP architecture

Eric J. Stotzer; Ernst L. Leiss

Digital Signal Processing (DSP) architectures are specialized for high performance numerical algorithms such as those found in communication and multimedia applications. The development of efficient compilers for DSP processors is a growing research area. The Texas Instruments TMS320C6x (C6x) is a Very Long Instruction Word (VLIW) DSP architecture capable of issuing eight operations in parallel. In this paper, we present the results of implementing a software pipelining algorithm for the C6x. We provide a description of the C6x and detail the architectural features that impact software pipelining such as a moderately sized register file, constraints on code size, homogeneous resources, and multiple assignment code. We discuss how we adapted modulo scheduling to implement software pipelining for the C6x. Finally, we present the results of modulo scheduling a set of 40 loop kernel benchmarks and measure the algorithm in terms of schedule quality and algorithm complexity.

international workshop on openmp | 2013

OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip

Eric J. Stotzer; Ajay Jayaraj; Murtaza Ali; Arnon Friedmann; Gaurav Mitra; Alistair P. Rendell; Ian Lintault

The Texas Instrument (TI) Keystone II architecture integrates an octa-core C66X DSP with a quad-core ARM Cortex A15 MPCore processor in a non-cache coherent shared memory environment. This System-on-a-Chip (SoC) offers very high Floating Point Operations per second (FLOPS) per Watt, if used efficiently. This paper reports an initial attempt at developing a bare-metal OpenMP runtime for the C66X multi-core DSP using the Open Event Machine RTOS. It also outlines an extension to OpenMP that allows code to run across both the ARM and the DSP cores simultaneously. Preliminary performance data for OpenMP constructs running on the ARM and DSP parts of the SoC are given and compared with other current processors.

symposium on computer architecture and high performance computing | 2012

Level-3 BLAS on the TI C6678 Multi-core DSP

Murtaza Ali; Eric J. Stotzer; Francisco D. Igual; Robert A. van de Geijn

Digital Signal Processors (DSP) are commonly employed in embedded systems. The increase of processing needs in cellular base-stations, radio controllers and industrial/medical imaging systems, has led to the development of multi-core DSPs as well as inclusion of floating point operations while maintaining low power dissipation. The eight-core DSP from Texas Instruments, codenamed TMS320C6678, provides a peak performance of 128 GFLOPS (single precision) and an effective 32 GFLOPS(double precision) for only 10 watts. In this paper, we present the first complete implementation and report performance of the Level-3 Basic Linear Algebra Subprograms(BLAS) routines for this DSP. These routines are first optimized for single core and then parallelized over the different cores using OpenMP constructs. The results show that we can achieve about 8 single precision GFLOPS/watt and 2.2double precision GFLOPS/watt for General Matrix-Matrix multiplication (GEMM). The performance of the rest of theLevel-3 BLAS routines is within 90% of the corresponding GEMM routines.

ieee international conference on high performance computing data and analytics | 2012

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

Francisco D. Igual; Murtaza Ali; Arnon Friedmann; Eric J. Stotzer; Timothy Wentz; Robert A. van de Geijn

Take a multicore Digital Signal Processor (DSP) chip designed for cellular base stations and radio network controllers, add floating-point capabilities to support 4G networks, and out of thin air a HPC engine is born. The potential for HPC is clear: It promises 128 GFLOPS (single precision) for 10 Watts; It is used in millions of network related devices and hence benefits from economies of scale; It should be simpler to program than a GPU. Simply put, it is fast, green, and cheap. But is it easy to use? In this paper, we show how this potential can be applied to general-purpose high performance computing, more specifically to dense matrix computations, without major changes in existing codes and methodologies, and with excellent performance and power consumption numbers.

international workshop on openmp | 2014

Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture

Gaurav Mitra; Eric J. Stotzer; Ajay Jayaraj; Alistair P. Rendell

The TI Keystone II architecture provides a unique combination of ARM Cortex-A15 processors with high performance TI C66x floating-point DSPs on a single low-power System-on-chip (SoC). Commercially available systems such as the HP Proliant m800 and nCore BrownDwarf are based on this ARM-DSP SoC. The Keystone II architecture promises to deliver high GFLOPS/Watt and is of increasing interest as it provides an alternate building block for future exascale systems. However, the success of this architecture is intimately related to the ease of migrating existing HPC applications for maximum performance. Effective use of all ARM and DSP cores and DMA co-processors is crucial for maximizing performance/watt. This paper explores issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model. Single precision (SGEMM) matrix multiplication performance of 110.11 GFLOPS and and double precision (DGEMM) performance of 29.15 GFLOPS was achieved on the TI Keystone II Evaluation Module Revision 3.0 (EVM). Trade-offs and factors affecting performance are discussed.

international conference on supercomputing | 2002

Affinity-based cluster assignment for unrolled loops

Gayathri Krishnamurthy; Elana D. Granston; Eric J. Stotzer

To compete performance-wise, modern VLIW processors must have fast clock rates and high instruction-level parallelism (ILP). Partitioning resources (functional units and registers) into clusters allows the processor to be clocked faster, but operand transfers across clusters can easily become a bottleneck. Increasing the number of functional units increases the potential ILP, but only helps if the functional units can be kept busy.To support these features, optimizations such as loop unrolling must be applied to expose ILP, and instructions must be explicitly assigned to clusters to minimize cross-cluster transfers. In an architecture with homogeneous clusters, the number of functional units of a given type is typically a multiple of the number of clusters. Thus, it is common to unroll a loop so that the number of copies of the loop body is a multiple of the number of clusters. The result is that there is a natural mapping of instructions to clusters, which is often the best mapping. While this mapping can be obvious by inspection, we have found that existing cluster assignment algorithms often miss this natural split. The consequence is an excessive number of inter-cluster transfers, which slows down the loop.Because we were unable to find an existing cluster-assignment algorithm that performed well for unrolled loops, we developed our own. Our Affinity-Based Clustering (ABC) algorithm has been implemented in a production compiler for the Texas Instruments TMS320C6000, a two-cluster VLIW architecture. It is tailored for exploiting the patterns that result from either manual or compiler-based unrolling. As demonstrated experimentally, it performs well, even when post-unrolling optimizations partially obscure the natural split.

languages, compilers, and tools for embedded systems | 2009

Modulo scheduling without overlapped lifetimes

Eric J. Stotzer; Ernst L. Leiss

This paper describes complementary software- and hardware-based approaches for handling overlapping register lifetimes that occur in modulo scheduled loops. Modulo scheduling takes the N-instructions in a loop body and constructs an M-stage software pipeline. The length of each stage in the software pipeline is the Initiation Interval (II), which is the rate at which new loop iterations are started. An overlapped lifetime has a live range longer than the II, and as a consequence, the current iteration writes a new value to a register before a previous loop iteration has fin-ished using the old value. Hardware and software solutions for dealing with overlapped lifetimes have been proposed by re-searchers and also implemented in commercial products. These solutions include rotating register files, register queues, modulo variable expansion, and post-scheduling live range splitting. Each of these approaches has drawbacks for embedded systems such as an increase in silicon area, power consumption, and code size. Our approach, which is an improvement to the current solutions, prevents overlapped lifetimes through a combination of hardware and software techniques. The hardware element of our approach implements a register assignment latency that allows multiple in-flight writes to be pending to the same register. The software element of our approach uses dependence analysis and a constrained modulo scheduling algorithm to prevent overlapped lifetimes. We describe how to use these hardware and software techniques during modulo scheduling. Finally, we present the results of using our approach to compile embedded application code and present results in terms of modulo schedule quality and application performance.

Explore More