Johannes Kneip | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Johannes Kneip is active.

Explore More

Publication

Featured researches published by Johannes Kneip.

signal processing systems | 1998

The MPEG-4 video coding standard-a VLSI point of view

Johannes Kneip; Sven Bauer; J. Vollmer; Bernd Schmale; P. Kuhn; M. Reissmann

The paper presents an overview of the current status of the emerging MPEG-4 video coding standard and a discussion of the potential and problems for a practical implementation. Though the high flexibility of the standard suggests a software implementation on microprocessors or DSP, a complexity analysis of the standard proved, that the required processing power for a real time codec implementation quickly reaches the limits even of future high-performance microprocessors. But even with its high number of different algorithms, the standard leaves enough design space for a successful implementation as an optimised, but flexible low-cost, low-power solution. By identifying common arithmetic and transfer properties of the algorithms involved, a partitioning into a stream, video, and composition processor is proposed. Each of the units is programmable, but dedicated to the typical requirements of each algorithm class.

signal processing systems | 1999

Instruction Set Extensions for MPEG-4 Video

Mladen Berekovic; Hans-Joachim Stolberg; Mark Bernd Kulaczewski; Peter Pirsch; Henning Möller; Holger Runge; Johannes Kneip; Benno Stabernack

This paper describes instruction set extensions for the acceleration of MPEG-4 algorithms on programmable (RISC-) CPUs. MPEG-4 standardizes audio and video compression schemes for a variety of bit rates and scenarios. As MPEG-4 targets a much broader range of different applications than previously defined hybrid video coding standards like H.263 or MPEG-2, it employs a much higher number of different algorithms and coding modes. Therefore, MPEG-4 implementations will require a more software-oriented approach to be efficient. However, the total computational load for an optimized implementation of an MPEG-4 video codec is expected to exceed the performance levels of todays multimedia signal processors, making further hardware acceleration a necessity. For that purpose, we propose a number of instruction set extensions that add function-specific blocks to the data path of a CPU. These dedicated modules are highly adapted to the most computation-intensive processing schemes of MPEG-4, such as DCT, motion compensation, padding, shape coding, or bitstream parsing. The increased functionality of basic instructions results in a significant speed-up over standard RISC instruction sets, thus making MPEG-4 implementations feasible on programmable processor platforms. Possible target architectures include VLIW multimedia processors, MIMD-style multiprocessors, or coprocessor architectures

international symposium on microarchitecture | 1999

Applying and implementing the MPEG-4 multimedia standard

Johannes Kneip; Bernd Schmale; Henning Möller

MPEG-4s many algorithms and high-performance requirements pose a challenge for designers of a new generation of flexible, low-power coder/decoder VLSI implementations for mobile and portable applications. In addition to introducing MPEG4, this article looks at how the standard will enhance existing applications and enable new ones, and then discusses the ongoing effort to design an optimized MPEG-4 processor platform.

signal processing systems | 1995

An algorithm adapted autonomous controlling concept for a parallel single-chip digital signal processor

Johannes Kneip; Jens Wittenburg; Mladen Berekovic; K. Ronner; Peter Pirsch

Recent sub-μ semiconductor technology supports the monolithic integration of multiprocessor systems. High wiring density and short on-chip memory access cycles motivate novel architecture concepts, outperforming conventional parallel systems. An efficient controlling strategy is a key to gain high performance from limited silicon resources. In this paper, a controlling concept for a monolithic Autonomous Single-Instruction/Multiple Data (ASIMD) processor is presented, which combines the high parallelism of an SIMD approach with the flexibility of standard DSP architectures. To demonstrate the performance gains of the concept, a digital video signal processor, the HiPAR-DSP has been implemented. It consists of an array of 4 or 16 datapaths, local memories for each datapath, a shared memory with concurrent data access in shape of a matrix and a central RISC controller. A three stage execution autonomy has been implemented, consisting of conditional instructions, conditional skip of instructions by the data paths and global evaluation of local conditions by the central controller. This allows efficient execution of data dependent medium- and high-level algorithms with very low controlling overhead. A performance of up to two arithmetic gigaoperations per second is achieved for algorithms with irregular data flow or control flow for the 100 MHz clocked processor with 16 data paths.

signal processing systems | 1999

The MPEG-4 Multimedia Coding Standard: Algorithms, Architectures and Applications

Sven Bauer; Johannes Kneip; T. Mlasko; Bernd Schmale; J. Vollmer; A. Hutter; Mladen Berekovic

The upcoming MPEG-4 standard provides new possibilities for the compression and presentation of multimedia contents. The main characteristics of MPEG-4 are the object-based coding and representation of an audio-visual scene and the ability to code objects of natural or synthetic origin. These features will enhance existing applications with new functionalities and enable standardised solutions for new applications. This paper provides an overview of the three major parts Systems, Visual and Audio of the new MPEG-4 standard, highlights implementation aspects for some envisaged types of MPEG-4 terminals and describes possible future multimedia application scenarios using MPEG-4 functionalities.

international symposium on circuits and systems | 2000

The M-PIRE MPEG-4 codec DSP and its macroblock engine

Hans-Joachim Stolberg; Mladen Berekovic; Peter Pirsch; Holger Runge; Henning Möller; Johannes Kneip

M-PIRE is a programmable MPEG-4 multimedia codec VLSI for mobile and stationary applications. It integrates a RISC core, two separate DSPs, a 64-bit dual-issue VLIW macroblock engine, and an autonomous I/O processor on a single chip to cope with the high flexibility and processing demands of the MPEG-4 standard. The first M-PIRE implementation will consume 90 mm/sup 2/ in 0.25 /spl mu/ CMOS technology. It will support real-time video and audio processing of MPEG-4 simple profile or ITU H.26x standards; future designs of M-PIRE will add support for higher MPEG-4 profiles. This paper focuses on the architecture, instruction set, and performance of M-PIREs macroblock engine, which carries most of the workload in MPEG-4 video processing.

IEEE Transactions on Circuits and Systems for Video Technology | 1996

Architecture and applications of the HiPAR video signal processor

K. Ronner; Johannes Kneip

We propose the architecture of a highly parallel DSP (HiPAR-DSP) as a flexible and programmable processor for image and video processing. The design is based on an analysis of image processing algorithms in terms of available parallelization resources, demands on program control, and required data access mechanisms. This led to a very long instruction word (VLIW)-controlled ASIMD RISC-architecture with four or sixteen data paths, employing data-level parallelism, parallel instructions, micro-instruction pipelining, and data transfer concurrently to data processing. Common data access patterns for image processing algorithms are supported by use of a shared on-chip memory with parallel matrix type access patterns and a separate data-cache per data path. By properly balancing processing and controlling capabilities as internal and external memory bandwidth, this approach is optimized to make the best use of currently available silicon resources. A high clock frequency is achieved by implementation of classic RISC features. The architecture fully supports high level language programming. With the 16 data path version and a 100 MHz clock, a sustained performance of more than 2 billion arithmetic operations per second (GOPS) is achieved for a wide range of algorithms. The examples show the parallel implementation of image processing algorithms like histogramming, Hough transform, or search in a sorted list with efficient use of the processor resources. A prototype of the architecture with four parallel data paths is available, using a 0.6 /spl mu/m CMOS technology.

international conference on application specific array processors | 1994

A data path array with shared memory as core of a high performance DSP

Johannes Kneip; K. Ronner; Peter Pirsch

A data path array has been designed as core of a digital signal processor architecture for image processing applications. Data supply to data paths and exchange of data among data paths is performed via an on-chip shared memory with two-dimensional address space. Distribution of data onto these memory blocks enables simultaneous, conflict-free access to the shared memory by the data paths. Data that is accessed concurrently is addressed in shape of a generalized matrix i.e. a two-dimensional array with address-offsets between neighbors. Additionally, each data path has autonomous addressing capabilities to a distributed local cache memory. The combination of shared memory communication among the data paths and address and control autonomy of the array elements leads to the powerful core of a high-performance DSP, that is completed by a RISC-style controller and a DMA-unit for data transfer. Simulation results proved, that the processor will show high performance on a wide field of image processing applications. Assuming 100 MHz clock frequency for a 4/spl times/4 array, the processor will perform a 1024 samples complex FFT within 33 /spl mu/s including data I/O. The Hough transform of a 512/spl times/512 pel image with 30% black pels is performed within 66 ms, assuming 7 bit quantization for the angle and 11 bit quantization for the radius, thus achieving a sustained arithmetic performance of 2.8 Giga operations per second (GOPS).<<ETX>>

multimedia signal processing | 1997

An algorithm-hardware-system approach to VLIW multimedia processors

Johannes Kneip; Mladen Berekovic; Peter Pirsch

Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.

design automation conference | 1998

Realization of a programmable parallel DSP for high performance image processing applications

Jens Wittenburg; Mladen Berekovic; W. Hinrichs; H. Lieske; Johannes Kneip; H. Kloos; Martin Ohmacht; Peter Pirsch

Architecture and design of the HiPAR-DSP, a SIMD controlled signal processor with parallel data paths, VLIW and novel memory design is presented. The processor architecture is derived from an analysis of the target algorithms and specified in VHDL on register transfer level. A team of more than 20 graduate students covered the whole design process, including the synthesizable VHDL description, synthesis, routing and backannotation as the development of a complete software development environment. The 175 mm/sup 2/, 0.5 /spl mu/m 3LM CMOS design with 1.2 million transistors operates at 80 MHz and achieves a sustained performance of more than 600 million arithmetic operations.

Explore More