Eero Aho | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eero Aho is active.

Explore More

Publication

Featured researches published by Eero Aho.

IEEE Transactions on Circuits and Systems for Video Technology | 2006

A High-Performance Sum of Absolute Difference Implementation for Motion Estimation

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper presents a high-performance sum of absolute difference (SAD) architecture for motion estimation, which is the most time-consuming and compute-intensive part of video coding. The proposed architecture contains novel and efficient optimizations to overcome bottlenecks discovered in existing approaches. In addition, designed sophisticated control logic with multiple early termination mechanisms further enhance execution speed and make the architecture suitable for general-purpose usage. Hence, the proposed architecture is not restricted to a single block-matching algorithm in motion estimation, but a wide range of algorithms is supported. The proposed SAD architecture outperforms contemporary architectures in terms of execution speed and area efficiency. The proposed architecture with three pipeline stages, synthesized to a 0.18-mum CMOS technology, can attain 770-MHz operating frequency at a cost of less than 5600 gates. Correspondingly, performance metrics for the proposed low-latency 2-stage architecture are 730 MHz and 7500 gates

IEEE Transactions on Circuits and Systems for Video Technology | 2009

A Configurable Motion Estimation Architecture for Block-Matching Algorithms

Jarno Vanne; Eero Aho; Kimmo Kuusilinna; Timo D. Hämäläinen

This paper introduces a configurable motion estimation architecture for a wide range of fast block-matching algorithms (BMAs). Contemporary motion estimation architectures are either too rigid for multiple BMAs or the flexibility in them is implemented at the cost of reduced performance. The proposed architecture overcomes both of these limitations. The configurability of the proposed architecture is based on a new BMA framework that can be adjusted to support the desired set of BMAs. The chosen framework configuration is implemented by an intelligent control logic which is integrated to an efficient parallel memory system and distortion computation unit. The flexibility of the framework is demonstrated by mapping five different BMAs (BBGDS, DS, CDS, HEXBS, and TSS) to the architecture. The total execution time of the mapped BMAs is shown to be almost directly proportional to the number of tested checking points in the search area, so the architecture is very tolerant of different BMA-specific search strategies and search patterns. In addition, a run-time switching between supported BMAs can be done without performance compromises. With a 0.13-mum CMOS technology, the proposed architecture configured for HEXBS, BBGDS, and TSS requires only 14.2 kgates and 2.5 KB of memory at 200 MHz operating frequency. A performance comparison to the reference programmable architectures reveals that only the proposed implementation is able to process real-time (30 fps) fixed block-size motion estimation (1 reference frame) at full HDTV resolution (1920 times1080).

IEEE Transactions on Circuits and Systems for Video Technology | 2008

A Parallel Memory System for Variable Block-Size Motion Estimation Algorithms

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper proposes an efficient parallel memory system for algorithms applied in fixed and variable block-size motion estimation (VBSME). The proposed system is implemented by a novel combination of two parallel memory architectures. The distribution of data among the memory modules is modified over contemporary approaches and the optimized address computation unit enables a rapid address generation for accessed memory locations. Furthermore, the introduced data permutation scheme organizes data efficiently for storage and retrieval. The proposed system enables up to 4 X speedup in data storage and retrieves data up to 55% faster for VBSME compared with the reference implementations. With a 0.18- mum CMOS technology, the proposed memory addressing and data permutation scheme can be clocked at 980 MHz operating frequency with a cost of less than 6 kgates. On FPGA, the system can operate at 200 MHz with less than 700 logic elements. The results show that the proposed system is applicable to real-time VBSME at HDTV resolution.

IEEE Transactions on Circuits and Systems | 2005

Block-level parallel processing for scaling evenly divisible images

Eero Aho; Jarno Vanne; Timo D. Hämäläinen; Kimmo Kuusilinna

Image scaling is a frequent operation in medical image processing. This paper presents how two-dimensional (2-D) image scaling can be accelerated with a new coarse-grained parallel processing method. The method is based on evenly divisible image sizes which is, in practice, the case with most medical images. In the proposed method, the image is divided into slices and all the slices are scaled in parallel. The complexity of the method is examined with two parallel architectures while considering memory consumption and data throughput. Several scaling functions can be handled with these generic architectures including linear, cubic B-spline, cubic, Lagrange, Gaussian, and sinc interpolations. Parallelism can be adjusted independent of the complexity of the computational units. The most promising architecture is implemented as a simulation model and the hardware resources as well as the performance are evaluated. All the significant resources are shown to be linearly proportional to the parallelization factor. With contemporary programmable logic, real-time scaling is achievable with large resolution 2-D images and a good quality interpolation. The proposed block-level scaling is also shown to increase software scaling performance over four times.

IEEE Transactions on Circuits and Systems for Video Technology | 2005

Comments on "Winscale: an image-scaling algorithm using an area pixel Model"

Eero Aho; Jarno Vanne; Kimmo Kuusilinna; Timo D. Hämäläinen

In the paper by Kim et al. (2003), the authors propose a new image scaling method called winscale. The presented method can be used for scaling up and down. However, scaling down utilizing the winscale concept gives exactly the same results as the well-known bilinear interpolation. Furthermore, compared to bilinear, scaling up with the proposed winscale overlap stamping method has very similar calculations. The basic winscale upscaling differs from the bilinear method.

design, automation, and test in europe | 2009

A case for multi-channel memories in video recording

Eero Aho; Jari Nikara; Petri Antero Tuominen; Kimmo Kuusilinna

In video recording, ever increasing demands on image resolution, frame rate, and quality necessitate a lot of memory bandwidth and energy. This paper presents and evaluates such a potential memory load in future handheld multimedia devices. Based on the achieved simulation results, the multi-channel memories provide the capability for high bandwidth without excessive overhead in terms of energy consumption. A full HDTV (1080p) quality video recording with H.264/AVC encoding at 30 frames per second (fps) is found here to require 4.3 GB/s memory bandwidth. According to the simulations, this memory requirement can be fulfilled with four 32-bit memory channels operating at 400 MHz and consuming 345 mW of power. As another example, 400 MHz 8-channel memory configuration is able to provide the required bandwidth for video recording with up to 3840times2160@30 fps. Die stacking is the technology thought to be able to provide the required bandwidth, sufficiently low power consumption, and the multi-channel memory organization.

international conference on embedded computer systems: architectures, modeling, and simulation | 2006

Parallel Memory Implementation for Arbitrary Stride Accesses

Eero Aho; Jarno Vanne; Timo D. Hämäläinen

Parallel memory modules can be used to increase memory bandwidth and feed a processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the data patterns have equal amount of accessed data elements as the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively

design and diagnostics of electronic circuits and systems | 2006

Parallel Memory Architecture for Arbitrary Stride Accesses

Eero Aho; Jarno Vanne; Timo D. Hämäläinen

Parallel memory modules can be used to increase memory bandwidth and feed a processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory architecture allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the data patterns have equal amount of accessed data elements as the number of memory modules. The complexity is evaluated with resource counts

Microprocessors and Microsystems | 2007

Configurable implementation of parallel memory based real-time video downscaler

Eero Aho; Jarno Vanne; Timo D. Hämäläinen; Kimmo Kuusilinna

Image downscaling is necessary in multiresolution video streaming and when a camera captures larger resolution frames than required. This paper presents an implementation of a downscaler capable of real-time scaling of color video. The scaler can be configured to support nearly arbitrary scaling ratios. The scaling method is based on evenly divisible image sizes, which is, in practice, the case in most video and image standards. Bilinear interpolation is utilized as the scaling algorithm. Fine-grained parallel processing is utilized to increase performance and parallel memories are used to attain the required bandwidth. The results show that an FPGA implementation can downscale 16VGA and HDTV video in real-time with a complexity of less than half of the reference implementations.

signal processing systems | 2008

Configurable data memory for multimedia processing

Eero Aho; Jarno Vanne; Timo D. Hämäläinen

In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129xa0MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system.

Explore More