Pekka Jääskeläinen

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pekka Jääskeläinen is active.

Explore More

Publication

Featured researches published by Pekka Jääskeläinen.

field-programmable logic and applications | 2010

Customized Exposed Datapath Soft-Core Design Flow with Compiler Support

Otto Esko; Pekka Jääskeläinen; Pablo Huerta; Carlos Sanches De La Lama; Jarmo Takala; José Ignacio Martínez

A popular way to exploit high level programming languages in FPGA designs is to use a soft-core with accompanying software development tools. However, a common shortcoming with the current soft-core offerings is their limited software execution capability: the required performance for the implementation can be often reached only with instruction set extensions. In this paper, we propose and evaluate an application-specific processor design toolset that uses a multi-issue exposed data path processor architecture template. The main benefit of the architecture is scalability with respect to instruction-level parallelism (ILP). The design flow allows the designer to freely customize the data path resources in the core to exploit the ILP available in computation intensive kernels. The design toolset includes a retargetable C compiler and an architecture simulator, making design space exploration feasible. The experiments show that a relatively small soft-core tailored with the toolset provides significant speedups on software execution without using any instruction set extensions. The best measured speedup in comparison to the major commercial soft-cores was fourfold in applications from the CHStone benchmark suite, while the amount of consumed FPGA resources remained moderate.

International Journal of Parallel Programming | 2015

pocl: A Performance-Portable OpenCL Implementation

Pekka Jääskeläinen; Carlos S. de La Lama; Kalle Raiskila; Jarmo Takala; Heikki Berg

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

signal processing systems | 2015

Embedded Multi-Core Systems Dedicated to Dynamic Dataflow Programs

Hervé Yviquel; Alexandre Sanchez; Pekka Jääskeläinen; Jarmo Takala; Mickaël Raulet; Emmanuel Casseau

Multimedia applications and embedded platforms are both becoming very complex in order to improve user experience. Thus, multimedia developers need high-level methods to automate time-consuming and error-prone tasks. Dynamic dataflow modeling is attractive to describe complex applications, such as video codecs, at a high level of abstraction. This paper presents a dataflow-based design approach to implement video codecs on embedded multi-core platforms. First, we introduce a custom architecture model to design low-power multi-core chips based on distributed memory and Transport-Triggered Architecture processor cores. Then, we describe software synthesis techniques to improve dynamic dataflow implementations. This methodology has been implemented into open-source tools and demonstrated on video decoders based on the MPEG-4 Visual standard and the new High Efficiency Video Coding standard. The simulations achieve real-time decoding (40FPS) of high definition (720P) MPEG-4 Visual video sequences on a custom multi-core platform clocked at 1Ghz, which is an improvement of more than 100 % over previously proposed implementations.

international symposium on parallel and distributed processing and applications | 2013

Towards run-time actor mapping of dynamic dataflow programs onto multi-core platforms

Hervé Yviquel; Emmanuel Casseau; Mickaël Raulet; Pekka Jääskeläinen; Jarmo Takala

The emergence of massively parallel architectures, along with the necessity of new parallel programming models, has revived the interest on dataflow programming due to its ability to express concurrency. Although dynamic dataflow programming can be considered as a flexible approach for the development of scalable applications, there are still some open problems in concern of their execution. In this paper, we propose a low-cost mapping methodology to map dynamic dataflow programs over any multi-core platform. Our approach finds interesting mapping solutions in few milliseconds that makes it doable at regular time by translating it in an equivalent graph partitioning problem. Consequently, a good load balancing over the targeted platform can be maintained even with such unpredictable applications. We conduct experiments across three MPEG video decoders, including one based on the new High Efficiency Video Coding standard. Those dataflow-based video decoders are executed on two different platform: A desktop multi-core processor, and an embedded platform composed of interconnected and tiny Very Long Instruction Word - style processors. Our entire design flow is based on open-source tools. We present the influence of the number of processors on the performance and show that our method obtains a maximum decoding rate for 16 processors.

international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2008

Impact of Software Bypassing on Instruction Level Parallelism and Register File Traffic

Vladimír Guzma; Pekka Jääskeläinen; Pertti Kellomäki; Jarmo Takala

Software bypassing is a technique that allows programmer-controlled direct transfer of results of computations to the operands of data dependent operations, possibly removing the need to store some values in general purpose registers, while reducing the number of reads from the register file. Software bypassing also improves instruction level parallelism by reducing the number of false dependencies between operations caused by the reuse of registers. In this work we show how software bypassing affects cycle count and reduces register file reads and writes. We analyze previous register file bypassing methods and compare them with our improved software bypassing implementation. In addition, we propose heuristics when not to apply software bypassing to retain scheduling freedom when selecting function units for operations. The results show that we get at best 27% improvement to cycle count, as well as up to 48% less register reads and 45% less register writes with the use of bypassing.

compilers, architecture, and synthesis for embedded systems | 2014

Heuristics for greedy transport triggered architecture interconnect exploration

Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala

Most power dissipation in Very Large Instruction Word (VLIW) processors occurs in their large, multi-port register files. Transport Triggered Architecture (TTA) is a VLIW variant whose exposed datapath reduces the need for RF accesses and ports. However, the comparative advantage of TTAs suffers in practice from a wide instruction word and complex interconnection network (IC). We argue that these issues are at least partly due to suboptimal design choices. The design space of possible TTA architectures is very large, and previous automated and ad-hoc design methods often produce inefficient architectures. We propose a reduced design space where efficient TTAs can be generated in a short time using excecution trace-driven greedy exploration. The proposed approach is evaluated by optimizing the equivalent of a 4-issue VLIW architecture. The algorithm finishes quickly and produces a processor with 10% reduced core energy product compared to a fully-connected TTA. Since the generated processor has low IC power and a shorter instruction word than a typical 4-issue VLIW, the results support the hypothesis that these drawbacks of TTA can be worked around with efficient IC design.

international conference on embedded computer systems architectures modeling and simulation | 2013

Low-power application-specific FFT processor for LTE applications

Tomasz Patyk; David Guevorkian; Teemu Pitkänen; Pekka Jääskeläinen; Jarmo Takala

In this paper, we describe a processor architecture tailored to mixed-radix4/2/3 FFT algorithm. The proposed design supports all FFT sizes, namely 128-2048/1536, required by the LTE applications. The processor is based on the Transport Triggered Architecture processor architecture, which was customized with a set of function units, designed especially for the application at hand. The processor has been synthesized on an ASIC technology and both energy-efficiency and performance have been evaluated. The developed processor is programmable but shows energy-efficiency comparable to fixed-function ASIC implementations.

international conference on wireless communications and mobile computing | 2013

A 122Mb/s Turbo decoder using a mid-range GPU

Jiao Xianjun; Chen Canfeng; Pekka Jääskeläinen; Vladimír Guzma; Heikki Berg

Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.

international conference on acoustics, speech, and signal processing | 2013

Simplified floating-point division and square root

Timo Viitanen; Pekka Jääskeläinen; Otto Esko; Jarmo Takala

Digital Signal Processing (DSP) algorithms on low-power embedded platforms are often implemented using fixed-point arithmetic due to expected power and area savings over floating-point computation. However, recent research shows that floating-point arithmetic can be made competitive by using a reduced-precision format instead of, e.g., IEEE standard single precision, thereby avoiding the algorithm design and implementation difficulties associated with fixed-point arithmetic. This paper investigates the effects of simplified floating-point arithmetic applied to an FMA-based floating-point unit and the associated software division and square root operations. Software operations are proposed which attain near-exact precision with twice the performance of exact algorithms and resolve overflow-related errors with inexpensive exponent-manipulation special instructions.

signal processing systems | 2015

Code Density and Energy Efficiency of Exposed Datapath Architectures

Pekka Jääskeläinen; Heikki Kultala; Timo Viitanen; Jarmo Takala

Exposing details of the processor datapath to the programmer is motivated by improvements in the energy efficiency and the simplification of the microarchitecture. However, an instruction format that can control the data path in a more explicit manner requires more expressiveness when compared to an instruction format that implements more of the control logic in the processor hardware and presents conventional general purpose register based instructions to the programmer. That is, programs for exposed datapath processors might require additional instruction memory bits to be fetched, which consumes additional energy. With the interest in energy and power efficiency rising in the past decade, exposed datapath architectures have received renewed attention. Several variations of the additional details to expose to the programmer have been proposed in the academy, and some exposed datapath features have also appeared in commercial architectures. The different variations of proposed exposed datapath architectures and their effects to the energy efficiency, however, have not so far been analyzed in a systematic manner in public. This article provides a review of exposed datapath approaches and highlights their differences. In addition, a set of interesting exposed datapath design choices is evaluated in a closer study. Due to the fact that memories constitute a major component of power consumption in contemporary processors, we analyze instruction encodings for different exposed datapath variations and consider the energy required to fetch the additional instruction bits in comparison to the register file access savings achieved with the exposed datapath.

Explore More