Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Ari Kulmala is active.

Publication


Featured researches published by Ari Kulmala.


digital systems design | 2007

On network-on-chip comparison

Erno Salminen; Ari Kulmala; Timo D. Hämäläinen

This paper presents the state-of-the-art in the field of network-on-chip (NoC) benchmarking and comparison. The study identifies the mainstream approaches, how NoCs are currently evaluated, and shows which aspects have been covered and those needing more research effort. No single article can cover all the aspects, and therefore, possibility to compare results from various sources must be ensured by proper scientific reporting. Basic guidelines for achieving that are given.


international symposium on circuits and systems | 2005

HIBI-based multiprocessor SoC on FPGA

Erno Salminen; Ari Kulmala; Timo D. Hämäläinen

An FPGA offers an excellent platform for a system-on-chip consisting of intellectual property (IP) blocks. The problem is that IP blocks and their interconnections are often FPGA vendor dependent. Our HIBI (heterogeneous IP block interconnection) network-on-chip (NoC) scheme solves the problem by providing a flexible interconnection network and IP block integration with an open core protocol (OCP) interface. Therefore, IP components can be of any type: processors; hardware accelerators; communication interfaces; memories. As a proof of concept, a multiprocessor system with eight soft processor cores and HIBI is prototyped on FPGA. The whole system uses 36,402 logic elements, 2.9 Mbits of RAM, and operates at 78 MHz frequency on the Altera Stratix 1S40, which is comparable to other FPGA multiprocessors. The most important benefit is significant reduction of the design effort compared to system specific interconnection networks. HIBI also presents the first OCP compliant IP-block integration in FPGA.


field-programmable logic and applications | 2005

A parallel MPEG-4 encoder for FPGA based multiprocessor SoC

Olli Lehtoranta; Erno Salminen; Ari Kulmala; Marko Hännikäinen; Timo D. Hämäläinen

A parallel MPEG-4 simple profile encoder for FPGA based multiprocessor system-on-chip (SoC) is presented. The goal is a computationally scalable framework independent of platform. The scalability is achieved by spatial parallelization where images are divided to horizontal slices. Slice coding tasks are mapped to the multiprocessor consisting of four soft-cores arranged into master-slave configuration. Also, the shared memory model is adopted where large images are stored in shared external memory while small on-chip buffers are used for processing. The interconnections between memories and processors are realized with our HIBI network. Our main contributions are the scalable encoder framework as well as methods for coping with limited memory of FPGA. The current software only implementation processes 6 QCIF frames/s with three encoding slaves. In practice, speed-ups of 1.7 and 2.3 have been measured with two and three slaves, respectively. FPGA utilization of current implementation is 59% requiring 24 207 logic elements on Altera Stratix EP1S40.


Eurasip Journal on Embedded Systems | 2006

Scalable MPEG-4 encoder on FPGA multiprocessor SOC

Ari Kulmala; Olli Lehtoranta; Timo D. Hämäläinen; Marko Hännikäinen

High computational requirements combined with rapidly evolving video coding algorithms and standards are a great challenge for contemporary encoder implementations. Rapid specification changes prefer full programmability and configurability both for software and hardware. This paper presents a novel scalable MPEG-4 video encoder on an FPGA-based multiprocessor system-on-chip (MPSOC). The MPSOC architecture is truly scalable and is based on a vendor-independent intellectual property (IP) block interconnection network. The scalability in video encoding is achieved by spatial parallelization where images are divided to horizontal slices. A case design is presented with up to four synthesized processors on an Altera Stratix 1S40 device. A truly portable ANSI-C implementation that supports an arbitrary number of processors gives 11 QCIF frames/s at 50 MHz without processor specific optimizations. The parallelization efficiency is 97% for two processors and 93% with three. The FPGA utilization is 70%, requiring 28 797 logic elements. The implementation effort is significantly lower compared to traditional multiprocessor implementations.


international conference on embedded computer systems architectures modeling and simulation | 2007

Evaluating large system-on-chip on multi-FPGA platform

Ari Kulmala; Erno Salminen; Timo D. Hämäläinen

This paper presents a configurable base architecture tailorable for different applications. It allows simple and rapid way to evaluate and prototype large Multi-Processor System-on-Chip architectures on multiple FPGAs with support to Globally Asynchronous Locally Synchronous scheme. It allows early hardware/software co-verification and optimization. The architecture abstracts the underlying hardware details from the processors so that knowledge about the exact locations of individual components are not required for communication. Implemented example architecture contains 58 IP blocks, including 35 Nios II soft processors. As a proof of concept, a MPEG-4 video encoder is run on the example architecture.


design and diagnostics of electronic circuits and systems | 2007

Instruction Memory Architecture Evaluation on Multiprocessor FPGA MPEG-4 Encoder

Ari Kulmala; Erno Salminen; Timo D. Hämäläinen

Memory is a significant performance limiting factor of the multiprocessor systems especially when shared. In FPGAs, the memory amount of the device is fixed and thus, optimal memory usage is essential. This paper analyses how the fixed amount of memory should be divided between instruction memories and instruction caches for multiprocessor systems and compromised with the number of processors. The measurements are done with a SPMD (Single Program Multiple Data) multiprocessor system of up to 14 soft core processors running a MPEG-4 video encoder on FPGA. The instruction memory count is ranged between one and seven. It is shown that the traditional distributed memory architecture is outperformed by shared instruction memories with sufficient cache sizes. The number of processors is in general the most significant single factor once the sufficient cache size is reached. The best performance was obtained with only one shared instruction memory, 8 KB cache and 13 processors.


field-programmable logic and applications | 2006

Reliable GALS Implementation of MPEG-4 Encoder with Mixed Clock FIFO on Standard FPGA

Ari Kulmala; Timo D. Hämäläinen; Marko Hännikäinen

Globally asynchronous locally synchronous (GALS) is a paradigm for complexity management and re-use of large system-on-chip (SoC) architectures. GALS is most often based on specific ASIC design components or special FPGA platforms with custom development tools. In this paper we present a multiprocessor GALS implementation on a standard commercial FPGA with standard development tools. The key building block is a novel, reliable RTL mixed clock FIFO. A complete MPEG-4 video encoder with four processors is implemented for proofing the concept. The area overhead compared to a fully synchronous design is shown to be only 2% and the performance overhead is 3%. This is negligible compared to the benefits that are much better flexibility, ASIC or FPGA vendor independency, and reduced design time. Furthermore, the mixed-clock interfaces allow easy re-usability, since the RTL-level blocks do not need to be re-verified in design iterations


norchip | 2006

Distributed Bus Arbitration Algorithm Comparison on FPGA Based MPEG-4 Multiprocessor SoC

Ari Kulmala; Erno Salminen; Timo D. Hämäläinen

The communication is predicted to pass the computation as the limiting factor of performance of complex digital circuits. The most common communication medium is a shared bus. The contemporary buses have evolved as the requirements for the communication have increased. The new properties of the buses affect also the arbitration schemes. In this paper, the authors present a study on distributed arbitration with an advanced on-chip bus, HIBI. MPEG-4 video encoder is used as a test case. The compared arbitration algorithms are round-robin, priority, their combination, and random, all with varying parameters. They are compared with different bus utilization ranging from 3% to 75% and limited transfer length. Results show that the arbitration algorithm may account for up to 60% increase in performance and different transfer lengths may increase the performance by 350%


digital systems design | 2006

Comparison of GALS and Synchronous Architectures with MPEG-4 Video Encoder on Multiprocessor System-on-Chip FPGA

Ari Kulmala; Timo D. Hämäläinen; Marko Hännikäinen

In large system-on-chip (SoC) architectures, balancing the clock network is increasingly difficult. Globally asynchronous locally synchronous (GALS) removes the need for global clock net, and also provides efficient means for managing the complexity and re-use in large architectures. However, quantitative comparisons of GALS against similar synchronous structures are rare for full SoC architectures. In this paper, we compare our SoC GALS architectures to a synchronous architecture with a fully functional MPEG-4 video encoder on FPGA. The results show that the area and performance overhead of GALS is only 1%. That is negligible compared to the benefits of the GALS architecture such as multiple clock frequencies for intellectual property (IP) blocks and dynamic frequency/voltage scaling, clock tree removal, and re-usability. Our architecture does not require modifications to the IP blocks already used with synchronous architectures, providing an ideal solution for rapid switch to GALS architecture


design and diagnostics of electronic circuits and systems | 2006

Impact of Shared Instruction Memory on Performance of FPGA-based MP-SoC Video Encoder

Ari Kulmala; Erno Salminen; Olli Lehtoranta; Timo D. Hämäläinen; Marko Hännikäinen

The impact of shared instruction memory on performance is measured and analyzed for an FPGA-based multiprocessor system-on-chip (MP-SoC) with an MPEG-4 video encoding application. Our MP-SoC architecture allows arbitrary scaling of the number of synthesized processors and includes a monitoring unit for memory transfers. Based on the measurements with up to four processors on Altera Stratix 1S40, an estimate of the effect of the shared memory for larger configurations is presented. The shared instruction memory is shown to be area-efficient and sufficient in performance for configurations up to five processors, as the drop in encoded video frame rate stays below one compared to distributed instruction memory organization

Collaboration


Dive into the Ari Kulmala's collaboration.

Top Co-Authors

Avatar

Timo D. Hämäläinen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Erno Salminen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Marko Hännikäinen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Jarno Vanne

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Olli Lehtoranta

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Vili Viitamaki

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Antti Rasmus

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Arto Oinonen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Panu Sjovall

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Marko Viitanen

Tampere University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge