Ari Kulmala
Tampere University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ari Kulmala.
digital systems design | 2007
Erno Salminen; Ari Kulmala; Timo D. Hämäläinen
This paper presents the state-of-the-art in the field of network-on-chip (NoC) benchmarking and comparison. The study identifies the mainstream approaches, how NoCs are currently evaluated, and shows which aspects have been covered and those needing more research effort. No single article can cover all the aspects, and therefore, possibility to compare results from various sources must be ensured by proper scientific reporting. Basic guidelines for achieving that are given.
international symposium on circuits and systems | 2005
Erno Salminen; Ari Kulmala; Timo D. Hämäläinen
An FPGA offers an excellent platform for a system-on-chip consisting of intellectual property (IP) blocks. The problem is that IP blocks and their interconnections are often FPGA vendor dependent. Our HIBI (heterogeneous IP block interconnection) network-on-chip (NoC) scheme solves the problem by providing a flexible interconnection network and IP block integration with an open core protocol (OCP) interface. Therefore, IP components can be of any type: processors; hardware accelerators; communication interfaces; memories. As a proof of concept, a multiprocessor system with eight soft processor cores and HIBI is prototyped on FPGA. The whole system uses 36,402 logic elements, 2.9 Mbits of RAM, and operates at 78 MHz frequency on the Altera Stratix 1S40, which is comparable to other FPGA multiprocessors. The most important benefit is significant reduction of the design effort compared to system specific interconnection networks. HIBI also presents the first OCP compliant IP-block integration in FPGA.
field-programmable logic and applications | 2005
Olli Lehtoranta; Erno Salminen; Ari Kulmala; Marko Hännikäinen; Timo D. Hämäläinen
A parallel MPEG-4 simple profile encoder for FPGA based multiprocessor system-on-chip (SoC) is presented. The goal is a computationally scalable framework independent of platform. The scalability is achieved by spatial parallelization where images are divided to horizontal slices. Slice coding tasks are mapped to the multiprocessor consisting of four soft-cores arranged into master-slave configuration. Also, the shared memory model is adopted where large images are stored in shared external memory while small on-chip buffers are used for processing. The interconnections between memories and processors are realized with our HIBI network. Our main contributions are the scalable encoder framework as well as methods for coping with limited memory of FPGA. The current software only implementation processes 6 QCIF frames/s with three encoding slaves. In practice, speed-ups of 1.7 and 2.3 have been measured with two and three slaves, respectively. FPGA utilization of current implementation is 59% requiring 24 207 logic elements on Altera Stratix EP1S40.
Eurasip Journal on Embedded Systems | 2006
Ari Kulmala; Olli Lehtoranta; Timo D. Hämäläinen; Marko Hännikäinen
High computational requirements combined with rapidly evolving video coding algorithms and standards are a great challenge for contemporary encoder implementations. Rapid specification changes prefer full programmability and configurability both for software and hardware. This paper presents a novel scalable MPEG-4 video encoder on an FPGA-based multiprocessor system-on-chip (MPSOC). The MPSOC architecture is truly scalable and is based on a vendor-independent intellectual property (IP) block interconnection network. The scalability in video encoding is achieved by spatial parallelization where images are divided to horizontal slices. A case design is presented with up to four synthesized processors on an Altera Stratix 1S40 device. A truly portable ANSI-C implementation that supports an arbitrary number of processors gives 11 QCIF frames/s at 50 MHz without processor specific optimizations. The parallelization efficiency is 97% for two processors and 93% with three. The FPGA utilization is 70%, requiring 28 797 logic elements. The implementation effort is significantly lower compared to traditional multiprocessor implementations.
international conference on embedded computer systems architectures modeling and simulation | 2007
Ari Kulmala; Erno Salminen; Timo D. Hämäläinen
This paper presents a configurable base architecture tailorable for different applications. It allows simple and rapid way to evaluate and prototype large Multi-Processor System-on-Chip architectures on multiple FPGAs with support to Globally Asynchronous Locally Synchronous scheme. It allows early hardware/software co-verification and optimization. The architecture abstracts the underlying hardware details from the processors so that knowledge about the exact locations of individual components are not required for communication. Implemented example architecture contains 58 IP blocks, including 35 Nios II soft processors. As a proof of concept, a MPEG-4 video encoder is run on the example architecture.
design and diagnostics of electronic circuits and systems | 2007
Ari Kulmala; Erno Salminen; Timo D. Hämäläinen
Memory is a significant performance limiting factor of the multiprocessor systems especially when shared. In FPGAs, the memory amount of the device is fixed and thus, optimal memory usage is essential. This paper analyses how the fixed amount of memory should be divided between instruction memories and instruction caches for multiprocessor systems and compromised with the number of processors. The measurements are done with a SPMD (Single Program Multiple Data) multiprocessor system of up to 14 soft core processors running a MPEG-4 video encoder on FPGA. The instruction memory count is ranged between one and seven. It is shown that the traditional distributed memory architecture is outperformed by shared instruction memories with sufficient cache sizes. The number of processors is in general the most significant single factor once the sufficient cache size is reached. The best performance was obtained with only one shared instruction memory, 8 KB cache and 13 processors.
field-programmable logic and applications | 2006
Ari Kulmala; Timo D. Hämäläinen; Marko Hännikäinen
Globally asynchronous locally synchronous (GALS) is a paradigm for complexity management and re-use of large system-on-chip (SoC) architectures. GALS is most often based on specific ASIC design components or special FPGA platforms with custom development tools. In this paper we present a multiprocessor GALS implementation on a standard commercial FPGA with standard development tools. The key building block is a novel, reliable RTL mixed clock FIFO. A complete MPEG-4 video encoder with four processors is implemented for proofing the concept. The area overhead compared to a fully synchronous design is shown to be only 2% and the performance overhead is 3%. This is negligible compared to the benefits that are much better flexibility, ASIC or FPGA vendor independency, and reduced design time. Furthermore, the mixed-clock interfaces allow easy re-usability, since the RTL-level blocks do not need to be re-verified in design iterations
norchip | 2006
Ari Kulmala; Erno Salminen; Timo D. Hämäläinen
The communication is predicted to pass the computation as the limiting factor of performance of complex digital circuits. The most common communication medium is a shared bus. The contemporary buses have evolved as the requirements for the communication have increased. The new properties of the buses affect also the arbitration schemes. In this paper, the authors present a study on distributed arbitration with an advanced on-chip bus, HIBI. MPEG-4 video encoder is used as a test case. The compared arbitration algorithms are round-robin, priority, their combination, and random, all with varying parameters. They are compared with different bus utilization ranging from 3% to 75% and limited transfer length. Results show that the arbitration algorithm may account for up to 60% increase in performance and different transfer lengths may increase the performance by 350%
digital systems design | 2006
Ari Kulmala; Timo D. Hämäläinen; Marko Hännikäinen
In large system-on-chip (SoC) architectures, balancing the clock network is increasingly difficult. Globally asynchronous locally synchronous (GALS) removes the need for global clock net, and also provides efficient means for managing the complexity and re-use in large architectures. However, quantitative comparisons of GALS against similar synchronous structures are rare for full SoC architectures. In this paper, we compare our SoC GALS architectures to a synchronous architecture with a fully functional MPEG-4 video encoder on FPGA. The results show that the area and performance overhead of GALS is only 1%. That is negligible compared to the benefits of the GALS architecture such as multiple clock frequencies for intellectual property (IP) blocks and dynamic frequency/voltage scaling, clock tree removal, and re-usability. Our architecture does not require modifications to the IP blocks already used with synchronous architectures, providing an ideal solution for rapid switch to GALS architecture
design and diagnostics of electronic circuits and systems | 2006
Ari Kulmala; Erno Salminen; Olli Lehtoranta; Timo D. Hämäläinen; Marko Hännikäinen
The impact of shared instruction memory on performance is measured and analyzed for an FPGA-based multiprocessor system-on-chip (MP-SoC) with an MPEG-4 video encoding application. Our MP-SoC architecture allows arbitrary scaling of the number of synthesized processors and includes a monitoring unit for memory transfers. Based on the measurements with up to four processors on Altera Stratix 1S40, an estimate of the effect of the shared memory for larger configurations is presented. The shared instruction memory is shown to be area-efficient and sufficient in performance for configurations up to five processors, as the drop in encoded video frame rate stays below one compared to distributed instruction memory organization