Kimmo Kuusilinna | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kimmo Kuusilinna is active.

Explore More

Publication

Featured researches published by Kimmo Kuusilinna.

ACM Transactions in Embedded Computing Systems | 2006

UML-based multiprocessor SoC design framework

Tero Kangas; Petri Kukkala; Heikki Orsila; Erno Salminen; Marko Hännikäinen; Timo D. Hämäläinen; Jouni Riihimäki; Kimmo Kuusilinna

This paper describes a complete design flow for multiprocessor systems-on-chips (SoCs) covering the design phases from system-level modeling to FPGA prototyping. The design of complex heterogeneous systems is enabled by raising the abstraction level and providing several system-level design automation tools. The system is modeled in a UML design environment following a new UML profile that specifies the practices for orthogonal application and architecture modeling. The design flow tools are governed in a single framework that combines the subtools into a seamless flow and visualizes the design process. Novel features also include an automated architecture exploration based on the system models in UML, as well as the automatic back and forward annotation of information in the design flow. The architecture exploration is based on the global optimization of systems that are composed of subsystems, which are then locally optimized for their particular purposes. As a result, the design flow produces an optimized component allocation, task mapping, and scheduling for the described application. In addition, it implements the entire system for FPGA prototyping board. As a case study, the design flow is utilized in the integration of state-of-the-art technology approaches, including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC. In this study, a central part of a WLAN terminal is modeled, verified, optimized, and prototyped with the presented framework.

IEEE Transactions on Circuits and Systems for Video Technology | 2006

A High-Performance Sum of Absolute Difference Implementation for Motion Estimation

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper presents a high-performance sum of absolute difference (SAD) architecture for motion estimation, which is the most time-consuming and compute-intensive part of video coding. The proposed architecture contains novel and efficient optimizations to overcome bottlenecks discovered in existing approaches. In addition, designed sophisticated control logic with multiple early termination mechanisms further enhance execution speed and make the architecture suitable for general-purpose usage. Hence, the proposed architecture is not restricted to a single block-matching algorithm in motion estimation, but a wide range of algorithms is supported. The proposed SAD architecture outperforms contemporary architectures in terms of execution speed and area efficiency. The proposed architecture with three pipeline stages, synthesized to a 0.18-mum CMOS technology, can attain 770-MHz operating frequency at a cost of less than 5600 gates. Correspondingly, performance metrics for the proposed low-latency 2-stage architecture are 730 MHz and 7500 gates

international symposium on circuits and systems | 2002

Overview of bus-based system-on-chip interconnections

Erno Salminen; Vesa Lahtinen; Kimmo Kuusilinna; Timo D. Hämäläinen

This paper introduces the basic properties, such as structure, transfer properties and arbitration of bus-based interconnections for System-on-Chip (SoC) designs. The overview shows that contemporary SoC buses differ only in minor details. As a result, practically every studied interconnection method could rather easily conform to a common interface. Such an interface would enhance design re-use and make system design easier. However, due to their similarity, the choice between buses is not a straightforward task.

rapid system prototyping | 2003

Rapid design and analysis of communication systems using the BEE hardware emulation environment

Chen Chang; Kimmo Kuusilinna; Brian C. Richards; Allen Chen; Nathan Chan; Robert W. Brodersen; Borivoje Nikolic

This paper describes the early analysis and estimation features currently implemented in the Berkeley Emulation Engine (BEE) system. BEE is an integrated rapid prototyping and design environment for communication and digital signal processing (DSP) systems, consisting of four multi-FPGA based processing units, each capable of emulating 10 million ASIC (application specific integrated circuits) equivalent gates at an overall system clock rate up to 60 MHz. This translates to over 600 billion 16 bit additions (operations) per second on one unit. An integrated software design flow enables the users to specify the design using a data-flow diagram, then automatically generates both the FPGA implementation for real-time rapid prototyping and a cycle-accurate, bit-true, and functionally equivalent ASIC implementation. For system-level design, the BEE hardware and software support rapid design turn-around and early performance analysis, without full synthesis or hardware mapping, from the high-level design entry. A case study detailing a turbo-decoder explains how the processing capability of the emulator can be utilized to verify a design using one billion input vectors with a speed-up factor exceeding 106 over equivalent software simulation methods.

field programmable gate arrays | 2003

Implementation of BEE: a real-time large-scale hardware emulation engine

Chen Chang; Kimmo Kuusilinna; Brian C. Richards; Robert W. Brodersen

This paper describes the hardware implementation of a real-time, large-scale, multi-chip FPGA (Field Programmable Gate Array) based emulation engine with a capacity of 10 million ASIC (Application Specific Integrated Circuits) equivalent gates. Attainable system operation frequency can exceed 60 MHz, and the system throughput has been empirically verified to achieve 600 billion 16-bit additions per second. The emulator is custom designed to maximize the performance and resource utilization for a range of telecommunication and digital signal processing applications. With its high-speed interconnect architecture and large external I/O bandwidth, the emulator excels in prototyping real-time systems that have strict timing, logic capacity, and data rate requirements. Our development efforts are guided by such ongoing projects as ultra-wide band (UWB) and multi-channel-multi-antenna (MCMA) radio systems research.

signal processing systems | 2006

HIBI Communication Network for System-on-Chip

Erno Salminen; Tero Kangas; Timo D. Hämäläinen; Jouni Riihimäki; Vesa Lahtinen; Kimmo Kuusilinna

This paper presents a communication network targeted for complex system-on-chip (SoC) and network-on-chip (NoC) designs. The Heterogeneous IP Block Interconnection (HIBI) aims at maximum efficiency and minimum energy per transmitted bit combined with quality-of-service (QoS) in transfers. Other features include support for hierarchical topologies with several clock domains, flexible scalability, and runtime reconfiguration of network parameters. HIBI is intended for integrating coarse-grain components such as intellectual property (IP) blocks that have size of thousands of gates.HIBI has been implemented in VHDL and SystemC and synthesized on several CMOS technologies and on FPGA. A 32-bit wrapper requires 5400 gates and runs with 315 MHz on 0.18 μ m technology which shows that only minimal area overhead is paid for the advanced features. The area and frequency results are well comparable to other NoC proposals.Furthermore, data transfers are shown to approach the maximum theoretical performance for protocol efficiency. HIBI network is accompanied with a design framework with tools for optimizing the system through automated design space exploration.

IEEE Transactions on Circuits and Systems for Video Technology | 2009

A Configurable Motion Estimation Architecture for Block-Matching Algorithms

Jarno Vanne; Eero Aho; Kimmo Kuusilinna; Timo D. Hämäläinen

This paper introduces a configurable motion estimation architecture for a wide range of fast block-matching algorithms (BMAs). Contemporary motion estimation architectures are either too rigid for multiple BMAs or the flexibility in them is implemented at the cost of reduced performance. The proposed architecture overcomes both of these limitations. The configurability of the proposed architecture is based on a new BMA framework that can be adjusted to support the desired set of BMAs. The chosen framework configuration is implemented by an intelligent control logic which is integrated to an efficient parallel memory system and distortion computation unit. The flexibility of the framework is demonstrated by mapping five different BMAs (BBGDS, DS, CDS, HEXBS, and TSS) to the architecture. The total execution time of the mapped BMAs is shown to be almost directly proportional to the number of tested checking points in the search area, so the architecture is very tolerant of different BMA-specific search strategies and search patterns. In addition, a run-time switching between supported BMAs can be done without performance compromises. With a 0.13-mum CMOS technology, the proposed architecture configured for HEXBS, BBGDS, and TSS requires only 14.2 kgates and 2.5 KB of memory at 200 MHz operating frequency. A performance comparison to the reference programmable architectures reveals that only the proposed implementation is able to process real-time (30 fps) fixed block-size motion estimation (1 reference frame) at full HDTV resolution (1920 times1080).

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2004

HIBI v.2 Communication Network for System-on-Chip

Erno Salminen; Vesa Lahtinen; Tero Kangas; Jouni Riihimäki; Kimmo Kuusilinna; Timo D. Hämäläinen

This paper presents a communication network targeted for complex system-on-chip (SoC) and network-on-chip (NoC) designs. The Heterogeneous IP Block Interconnection v.2 (HIBI) aims at maximum efficiency and energy saving per transmitted bit combined with guaranteed quality-of-service (QoS) in transfers. Other features include support for arbitrary topologies with several clock domains, flexible scalablility in signalling and run-time reconfiguration of network parameters. HIBI has been implemented in VHDL and SystemC and synthesized in 0.18 CMOS technology with area comparable to other NoC wrappers. HIBI data transfers are shown to approach the maximum theoretical performance for protocol efficiency.

IEEE Transactions on Circuits and Systems for Video Technology | 2008

A Parallel Memory System for Variable Block-Size Motion Estimation Algorithms

Jarno Vanne; Eero Aho; Timo D. Hämäläinen; Kimmo Kuusilinna

This paper proposes an efficient parallel memory system for algorithms applied in fixed and variable block-size motion estimation (VBSME). The proposed system is implemented by a novel combination of two parallel memory architectures. The distribution of data among the memory modules is modified over contemporary approaches and the optimized address computation unit enables a rapid address generation for accessed memory locations. Furthermore, the introduced data permutation scheme organizes data efficiently for storage and retrieval. The proposed system enables up to 4 X speedup in data storage and retrieves data up to 55% faster for VBSME compared with the reference implementations. With a 0.18- mum CMOS technology, the proposed memory addressing and data permutation scheme can be clocked at 980 MHz operating frequency with a cost of less than 6 kgates. On FPGA, the system can operate at 200 MHz with less than 700 logic elements. The results show that the proposed system is applicable to real-time VBSME at HDTV resolution.

Journal of Systems Architecture | 2007

Benchmarking mesh and hierarchical bus networks in system-on-chip context

Erno Salminen; Tero Kangas; Vesa Lahtinen; Jouni Riihimäki; Kimmo Kuusilinna; Timo D. Hämäläinen

The performance and area of a System-on-Chip depend on the utilized communication method. This paper presents simulation-based comparison of generic, synthesizable single bus, hierarchical bus, and 2-dimensional mesh on-chip networks. Performance of the network depends heavily on the application and therefore six test cases with multiple parameter values are used. Furthermore, two versions of each network topology are compared. The results show that hierarchical bus scales well to large number of agents and offers a good performance and area trade-off although it has smaller aggregate bandwidth and area than mesh. Hierarchical HIBI bus achieves runtimes comparable to 2-dimensional cut-through mesh with about 50% smaller network logic. However, depending on the test case, the runtime can be reduced by 20-50% when wider bus links are utilized.

Explore More