Petri Liuha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Petri Liuha is active.

Explore More

Publication

Featured researches published by Petri Liuha.

IEEE Transactions on Very Large Scale Integration Systems | 2004

Multiple-symbol parallel decoding for variable length codes

Jari Nikara; Stamatis Vassiliadis; Jarmo Takala; Petri Liuha

In this paper, a multiple-symbol parallel variable length decoding (VLD) scheme is introduced. The scheme is capable of decoding all the codewords in an N-bit block of encoded input data stream. The proposed method partially breaks the recursive dependency related to the VLD. First, all possible codewords in the block are detected in parallel and lengths are returned. The procedure results redundant number of codeword lengths from which incorrect values are removed by recursive selection. Next, the index for each symbol corresponding the detected codeword is generated from the length determining the page and the partial codeword defining the offset in symbol table. The symbol lookup can be performed independently from symbol table. Finally, the sum of the valid codeword lengths is provided to an external shifter aligning the encoded input stream for a new decoding cycle. In order to prove feasibility and determine the limiting factors of our proposal, the variable length decoder has been implemented on an field-programmable gate-array (FPGA) technology. When applied to MPEG-2 standard benchmark scenes, on average 4.8 codewords are decoded per cycle resulting in the throughput of 106 million symbols per second.

IEEE Transactions on Circuits and Systems for Video Technology | 2002

Overview of research efforts on media ISA extensions and their usage in video coding

Ville Lappalainen; Timo D. Hämäläinen; Petri Liuha

This paper summarizes the results of over 25 research groups or individual researchers that have presented video coding implementations on general-purpose processors with the new single instruction multiple data media instruction set architecture extensions. The extensions are introduced and the fundamentals for extensions, as well as some inherent problems, are explained. The reported attempts to utilize the extensions are divided into kernel- and application-level, as well as platform dependent and independent optimizations. Optimized applications include, in addition to some proprietary methods, all of the major video coding standards such as H.261, H.263, MPEG-4, MPEG-1, and MPEG-2. These optimized implementations include a complete video codec, several decoders, and several encoders. Additionally, a performance comparison is given for four representative encoder implementations based on the reported results. Also included is an overview of future trends for new instructions and architectural speed-up techniques.

international conference on computer design | 2002

Parallel multiple-symbol variable-length decoding

Jari Nikara; Stamatis Vassiliadis; Jarmo Takala; Mihai Sima; Petri Liuha

In this paper a parallel Variable-Length Decoding (VLD) scheme is introduced. The scheme is capable of decoding all the codewords in an N-bit buffer whose accumulated codelength is at most N. The proposed method partially breaks the recursive dependency related to the MPEG-2 VLD. All possible codewords in the buffer are detected in parallel and the sum of the codelengths is provided to the external shifter aligning the variable-length coded input stream for a new decoding cycle. Two length detection mechanisms are proposed: the first approach determines the length in a parallel/serial fashion and the second using a new device denoted as MultiplexedAdd. In order to prove feasibility and determine the limiting factors of our proposal, the parallel/serial codeword detector with 32-bit input has been described in behavioral non-optimized VHDL and mapped onto Alteras ACEX EP1K100 FPGA. The implemented prototype exhibits a latency of 110 ns and uses 32% of the logic cells of the device. When applied to MPEG-2 standard benchmark scenes, on average 3.5 symbols are decoded per cycle.

languages, compilers, and tools for embedded systems | 2004

GraalBench: a 3D graphics benchmark suite for mobile phones

Iosif Antochi; Ben H. H. Juurlink; Stamatis Vassiliadis; Petri Liuha

In this paper we consider implementations of embedded 3D graphics and provide evidence indicating that 3D benchmarks employed for desktop computers are not suitable for mobile environments. Consequently, we present GraalBench, a set of 3D graphics workloads representative for contemporary and emerging mobile devices. In addition, we present detailed simulation results for a typical rasterization pipeline. The results show that the proposed benchmarks use only a part of the resources offered by current 3D graphics libraries. For instance, while each benchmark uses the texturing unit for more than 70% of the generated fragments, the alpha unit is employed for less than 13% of the fragments. The Fog unit was used for 84% of the fragments by one benchmark, but the other benchmarks did not use it at all. Our experiments on the proposed suite suggest that the texturing, depth and blending units should be implemented in hardware, while, for instance, the dithering unit may be omitted from a hardware implementation. Finally, we discuss the architectural implications of the obtained results for hardware implementations.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2004

Memory Bandwidth Requirements of Tile-Based Rendering

Iosif Antochi; Ben H. H. Juurlink; Stamatis Vassiliadis; Petri Liuha

Because mobile phones are omnipresent and equipped with displays, they are attractive platforms for rendering 3D images. However, because they are powered by batteries, a graphics accelerator for mobile phones should dissipate as little energy as possible. Since external memory accesses consume a significant amount of power, techniques that reduce the amount of external data traffic also reduce the power consumption. A technique that looks promising is tile-based rendering. This technique decomposes a scene into tiles and renders the tiles one by one. This allows the color components and z values of one tile to be stored in small, on-chip buffers, so that only the pixels visible in the final scene need to be stored in the external frame buffer. However, in a tile-based renderer each triangle may need to be sent to the graphics accelerator more than once, since it might overlap more than one tile. In this paper we measure the total amount of external data traffic produced by conventional and tile-based renderers using several representative OpenGL benchmark scenes. The results show that employing a tile size of 32 × 32 pixels generally yields the best trade-off between the amount of on-chip memory and the amount of external data traffic. In addition, the results show that overall, a tile-based architecture reduces the total amount of external data traffic by a factor of 1.96 compared to a traditional architecture.

signal processing systems | 2002

Architectures for the sum of absolute differences operation

David Guevorkian; Aki Launiainen; Petri Liuha; Ville Lappalainen

Efficient architectures for computing the sum of absolute differences (SAD) between two data sets are proposed in application to motion estimation in a mobile video coding system. The proposed architectures combine and further develop advantages of two earlier proposed architectures. As a result, higher performance is achieved despite the lower cost (gate count and power consumption) as compared to a conventional architecture. Proposed architectures are feasible for integrating into mobile video processing systems. They support not only regular, data independent motion estimation strategies but all of those based on the SAD criterion. Early termination mechanisms included into the proposed architecture allow one to avoid unnecessary computations which may often take place in conventional SAD architectures without such mechanisms.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2004

High-Level Energy Estimation for ARM-Based SOCs

Dan Crisu; Sorin Cotofana; Stamatis Vassiliadis; Petri Liuha

In recent years, power consumption has become a critical concern for many VLSI systems. Whereas several case studies demonstrate that technology-, layout-, and gate-level techniques offer power savings of a factor of two or less, architecture and system-level optimization can often result in orders of magnitude lower power consumption. Therefore, the energy-efficient design of portable, battery-powered systems demands an early assessment, i.e., at the algorithmic and architectural levels, of the power consumption of the applications they target. Addressing this issue, we developed an energy-aware architectural design exploration and analysis tool for ARM based system-on-chip designs. The tool integrates the behavior and energy models of several user-defined, custom processing units as an extension to the cycle-accurate instruction-level simulator for the ARM low-power processor family, called the ARMulator. The models we implemented take into account the particular class, e.g., datapath, memory, control, or interconnect, as well as the architectural complexity of the hardware unit involved and the signal activity triggered by the specific algorithm executed on the ARM processor. Our tool can estimate at the architectural level of detail the overall energy consumption or can report the energy breakdown among different units. Preliminary experiments indicated that the estimation accuracy is within 25% of what can be accomplished after a circuit-level simulation on the laid-out chip.

digital systems design | 2004

Scene management models and overlap tests for tile-based rendering

Iosif Antochi; Ben H. H. Juurlink; Stamatis Vassiliadis; Petri Liuha

Tile-based rendering (also called chunk rendering or bucket rendering) is a promising technique for low-power, 3D graphics platforms. This technique decomposes a scene into smaller regions called tiles and renders the tiles one-by-one. The advantage of this scheme is that a small memory integrated on the graphics accelerator can be used to store the color components and z values of one tile, so that accesses to the frame and z buffer are local, on-chip accesses which consume significantly less power than off-chip accesses. Tile-based rendering, however, requires that the primitives (commonly triangles) are sorted into bins corresponding to the tiles. This paper describes several algorithms for sorting the primitives into bins and evaluates their computational complexity and memory requirements. In addition, we present and evaluate several tests for determining if a triangle and a tile overlap. Experimental results obtained using several suitable 3D graphics workloads show that various trade-offs can be made and that, usually, better performance can be obtained by trading it for memory. This information allows the designer to select the appropriate method depending on the amount of memory available and the computational power.

design, automation, and test in europe | 2004

GRAAL - a development framework for embedded graphics accelerators

Dan Crisu; Sorin Cotofana; Stamatis Vassiliadis; Petri Liuha

This paper presents a versatile hardware/software co-simulation and co-design environment for embedded 3D graphics accelerators. The graphics accelerator design exploration framework (GRAAL) is an open system which offers a coherent development methodology based on an extensive library of systemC RTL models of graphics pipeline components. GRAAL incorporates tools to assist in the visual debugging of the graphics algorithms implemented in hardware, and to estimate the performance in terms of throughput, power consumption, and area.

IEEE Transactions on Circuits and Systems for Video Technology | 2005

A method for designing high-radix multiplier-based processing units for multimedia applications

David Guevorkian; Aki Launiainen; Ville Lappalainen; Petri Liuha; Konsta Punkka

Multifunctional architecture for video and image processing (MAVIP) to be used in multimedia systems are proposed. MAVIP is a family of reconfigurable architectures derived from a single high-radix (4, 8, or 16) multiplier structure where: a) the list of potential partial products obtained at the first stage of multiplication may be reused; b) pipeline stages may be parallelised at different level to achieve required clock frequency and to improve balancing between these stages; and c) interconnections between the operational blocks may be multiplexed to make the structure multifunctional and to allow reusing basic multiplier blocks. The same device may operate either as a programmable processing unit with digital signal processor-specific operations or as a reconfigurable ASIC. Being small, MAVIP indicates competitive performance in video coding applications.

Explore More