Is this you? Create Your Porfile

Jarno Vanne

Tampere University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jarno Vanne is active.

Explore More

Publication

Featured researches published by Jarno Vanne.

IEEE Transactions on Circuits and Systems for Video Technology | 2012

Comparative Rate-Distortion-Complexity Analysis of HEVC and AVC Video Codecs

Jarno Vanne; Marko Viitanen; Timo D. Hämäläinen; Antti Hallapuro

This paper analyzes the rate-distortion-complexity of High Efficiency Video Coding (HEVC) reference video codec (HM) and compares the results with AVC reference codec (JM). The examined software codecs are HM 6.0 using Main Profile (MP) and JM 18.0 using High Profile (HiP). These codes are benchmarked under the all-intra (AI), random access (RA), low-delay B (LB), and low-delay P (LP) coding configurations. In order to obtain a fair comparison, JM HiP anchor codec has been configured to conform to HM MP settings and coding configurations. The rate-distortion comparisons rely on objective quality assessments, i.e., bit rate differences for equal PSNR. The complexities of HM and JM have been profiled at the cycle level with Intel VTune on Intel Core 2 Duo processor. The coding efficiency of HEVC is drastically better than that of AVC. According to our experiments, the average bit rate decrements of HM MP over JM HiP are 23%, 35%, 40%, and 35% under the AI, RA, LB, and LP configurations, respectively. However, HM achieves its coding gain with a realistic overhead in complexity. Our profiling results show that the average software complexity ratios of HM MP and JM HiP encoders are 3.2× in the AI case, 1.2× in the RA case, 1.5× in the LB case, and 1.3× in the LP case. The respective ratios with HM MP and JM HiP decoders are 2.0×, 1.6×, 1.5×, and 1.4×. This paper also reveals the bottlenecks of HM codec and provides implementation guidelines for future real-time HEVC codecs.

IEEE Transactions on Circuits and Systems for Video Technology | 2006

A High-Performance Sum of Absolute Difference Implementation for Motion Estimation

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper presents a high-performance sum of absolute difference (SAD) architecture for motion estimation, which is the most time-consuming and compute-intensive part of video coding. The proposed architecture contains novel and efficient optimizations to overcome bottlenecks discovered in existing approaches. In addition, designed sophisticated control logic with multiple early termination mechanisms further enhance execution speed and make the architecture suitable for general-purpose usage. Hence, the proposed architecture is not restricted to a single block-matching algorithm in motion estimation, but a wide range of algorithms is supported. The proposed SAD architecture outperforms contemporary architectures in terms of execution speed and area efficiency. The proposed architecture with three pipeline stages, synthesized to a 0.18-mum CMOS technology, can attain 770-MHz operating frequency at a cost of less than 5600 gates. Correspondingly, performance metrics for the proposed low-latency 2-stage architecture are 730 MHz and 7500 gates

IEEE Transactions on Circuits and Systems for Video Technology | 2014

Efficient Mode Decision Schemes for HEVC Inter Prediction

Jarno Vanne; Marko Viitanen; Timo D. Hämäläinen

The emerging High Efficiency Video Coding (HEVC) standard reduces the bit rate by almost 40% over the preceding state-of-the-art Advanced Video Coding (AVC) standard with the same objective quality but at about 40% encoding complexity overhead. The main reason for HEVC complexity is inter prediction that accounts for 60%-70% of the whole encoding time. This paper analyzes the rate-distortion-complexity characteristics of the HEVC inter prediction as a function of different block partition structures and puts the analysis results into practice by developing optimized mode decision schemes for the HEVC encoder. The HEVC inter prediction involves three different partition modes: square motion partition, symmetric motion partition (SMP), and asymmetric motion partition (AMP) out of which the decision of SMPs and AMPs are optimized in this paper. The key optimization techniques behind the proposed schemes are: 1) a conditional evaluation of the SMP modes; 2) range limitations primarily in the SMP sizes and secondarily in the AMP sizes; and 3) a selection of the SMP and AMP ranges as a function of the quantization parameter. These three techniques can be seamlessly incorporated in the existing control structures of the HEVC reference encoder without limiting its potential parallelization, hardware acceleration, or speed-up with other existing encoder optimizations. Our experiments show that the proposed schemes are able to cut the average complexity of the HEVC reference encoder by 31%-51% at a cost of 0.2%-1.3% bit rate increase under the random access coding configuration. The respective values under the low-delay B coding configuration are 32%-50% and 0.3%-1.3%.

international symposium on circuits and systems | 2012

Complexity analysis of next-generation HEVC decoder

Marko Viitanen; Jarno Vanne; Timo D. Hämäläinen; Moncef Gabbouj; Jani Lainema

This paper analyzes the complexity of the HEVC video decoder being developed by the JCT-VC community. The HEVC reference decoder HM 3.1 is profiled with Intel VTune on Intel Core 2 Duo processor. The analysis covers both Low Complexity (LC) and High Efficiency (HE) settings for resolutions varying from WQVGA (416 × 240 pixels) up to 1600p (2560 × 1600 pixels). The yielded cycle-accurate results are compared with the respective results of H.264/AVC Baseline Profile (BP) and High Profile (HiP) reference decoders. HEVC offers significant improvement in compression efficiency over H.264/AVC: the average BD-rate saving of LC is around 51% over BP whereas the BD-rate gain of HE is around 45% over HiP. However, the average decoding complexities of LC and HE are increased by 61% and 87% over BP and HiP, respectively. In LC, the most complex functions are motion compensation (MC) and loop filtering (LF) that account on average for 50% and 14% of the decoder complexity. The decoding complexity of HE configuration is on average 42% higher than that of the LC configuration. Majority of the difference is caused by extra LF stages. In HE, the complexities of MC and LF are 37% and 32%, respectively. In practice, a standard 3 GHz dual core processor is expected to be able to decode 1080p HEVC content in real-time.

IEEE Transactions on Circuits and Systems for Video Technology | 2009

A Configurable Motion Estimation Architecture for Block-Matching Algorithms

Jarno Vanne; Eero Aho; Kimmo Kuusilinna; Timo D. Hämäläinen

This paper introduces a configurable motion estimation architecture for a wide range of fast block-matching algorithms (BMAs). Contemporary motion estimation architectures are either too rigid for multiple BMAs or the flexibility in them is implemented at the cost of reduced performance. The proposed architecture overcomes both of these limitations. The configurability of the proposed architecture is based on a new BMA framework that can be adjusted to support the desired set of BMAs. The chosen framework configuration is implemented by an intelligent control logic which is integrated to an efficient parallel memory system and distortion computation unit. The flexibility of the framework is demonstrated by mapping five different BMAs (BBGDS, DS, CDS, HEXBS, and TSS) to the architecture. The total execution time of the mapped BMAs is shown to be almost directly proportional to the number of tested checking points in the search area, so the architecture is very tolerant of different BMA-specific search strategies and search patterns. In addition, a run-time switching between supported BMAs can be done without performance compromises. With a 0.13-mum CMOS technology, the proposed architecture configured for HEXBS, BBGDS, and TSS requires only 14.2 kgates and 2.5 KB of memory at 200 MHz operating frequency. A performance comparison to the reference programmable architectures reveals that only the proposed implementation is able to process real-time (30 fps) fixed block-size motion estimation (1 reference frame) at full HDTV resolution (1920 times1080).

international symposium on circuits and systems | 2015

Kvazaar HEVC encoder for efficient intra coding

Marko Viitanen; Ari Koivula; Ari Lemmetti; Jarno Vanne; Timo D. Hämäläinen

This paper presents an open-source Kvazaar encoder for HEVC intra coding. This academic software encoder has been developed from the scratch using C as an implementation language by prioritizing modularity, portability, and readability of the source code. Kvazaar implements almost the same intra coding functionality as HEVC reference encoder (HM) but its rewritten source code makes it significantly faster. In all-intra (AI) coding, a single-threaded C implementation of Kvazaar is 2.3 times faster than HM at a cost of 1.7% bit rate increase. The respective values with a high speed preset of Kvazaar are 10.6 and 8.8%. Compared to a single-threaded C++ implementation of x265, Kvazaar improves rate-distortion performance and increases encoding speed in both high-quality and high-speed test cases. Kvazaar has a particular edge in the high-speed test case where it almost halves the BD-rate loss and more than doubles the performance.

IEEE Transactions on Circuits and Systems for Video Technology | 2008

A Parallel Memory System for Variable Block-Size Motion Estimation Algorithms

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper proposes an efficient parallel memory system for algorithms applied in fixed and variable block-size motion estimation (VBSME). The proposed system is implemented by a novel combination of two parallel memory architectures. The distribution of data among the memory modules is modified over contemporary approaches and the optimized address computation unit enables a rapid address generation for accessed memory locations. Furthermore, the introduced data permutation scheme organizes data efficiently for storage and retrieval. The proposed system enables up to 4 X speedup in data storage and retrieves data up to 55% faster for VBSME compared with the reference implementations. With a 0.18- mum CMOS technology, the proposed memory addressing and data permutation scheme can be clocked at 980 MHz operating frequency with a cost of less than 6 kgates. On FPGA, the system can operate at 200 MHz with less than 700 logic elements. The results show that the proposed system is applicable to real-time VBSME at HDTV resolution.

signal processing systems | 2015

Parallelization of Kvazaar HEVC intra encoder for multi-core processors

Ari Koivula; Marko Viitanen; Jarno Vanne; Timo D. Hämäläinen; Laurent Fasnacht

This paper introduces key parallelization strategies of our Kvazaar HEVC intra encoder for multicore processors. The schemes implemented in Kvazaar are 1) tiles; 2) Wavefront Parallel Processing (WPP); and 3) picture-level parallel processing. Kvazaar is the only practical open-source HEVC encoder that supports all these schemes. In addition, its rate-distortion-complexity characteristics are superior to other public implementations in all-intra (AI) coding. Our experiments with high-quality encoder presets show that a C implementation of Kvazaar is 19% faster than the corresponding implementation of x265 for the same coding efficiency with 8 threads and 38% faster with 16 threads. With the high-speed presets, Kvazaar improves coding efficiency by 4.5% while being twice as fast as x265. The high-speed preset of Kvazaar obtains almost the same coding efficiency as the high-quality preset of f265 while being 24 times faster when 16 threads are used.

IEEE Transactions on Circuits and Systems | 2005

Block-level parallel processing for scaling evenly divisible images

Eero Aho; Jarno Vanne; Timo D. Hämäläinen; Kimmo Kuusilinna

Image scaling is a frequent operation in medical image processing. This paper presents how two-dimensional (2-D) image scaling can be accelerated with a new coarse-grained parallel processing method. The method is based on evenly divisible image sizes which is, in practice, the case with most medical images. In the proposed method, the image is divided into slices and all the slices are scaled in parallel. The complexity of the method is examined with two parallel architectures while considering memory consumption and data throughput. Several scaling functions can be handled with these generic architectures including linear, cubic B-spline, cubic, Lagrange, Gaussian, and sinc interpolations. Parallelism can be adjusted independent of the complexity of the computational units. The most promising architecture is implemented as a simulation model and the hardware resources as well as the performance are evaluated. All the significant resources are shown to be linearly proportional to the parallelization factor. With contemporary programmable logic, real-time scaling is achievable with large resolution 2-D images and a good quality interpolation. The proposed block-level scaling is also shown to increase software scaling performance over four times.

international conference on embedded computer systems: architectures, modeling, and simulation | 2006

Parallel Memory Implementation for Arbitrary Stride Accesses

Eero Aho; Jarno Vanne; Timo D. Hämäläinen

Parallel memory modules can be used to increase memory bandwidth and feed a processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the data patterns have equal amount of accessed data elements as the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively

Explore More