Is this you? Create Your Porfile

Heikki Kultala

Tampere University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Heikki Kultala is active.

Explore More

Publication

Featured researches published by Heikki Kultala.

compilers, architecture, and synthesis for embedded systems | 2014

Heuristics for greedy transport triggered architecture interconnect exploration

Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala

Most power dissipation in Very Large Instruction Word (VLIW) processors occurs in their large, multi-port register files. Transport Triggered Architecture (TTA) is a VLIW variant whose exposed datapath reduces the need for RF accesses and ports. However, the comparative advantage of TTAs suffers in practice from a wide instruction word and complex interconnection network (IC). We argue that these issues are at least partly due to suboptimal design choices. The design space of possible TTA architectures is very large, and previous automated and ad-hoc design methods often produce inefficient architectures. We propose a reduced design space where efficient TTAs can be generated in a short time using excecution trace-driven greedy exploration. The proposed approach is evaluated by optimizing the equivalent of a 4-issue VLIW architecture. The algorithm finishes quickly and produces a processor with 10% reduced core energy product compared to a fully-connected TTA. Since the generated processor has low IC power and a shorter instruction word than a typical 4-issue VLIW, the results support the hypothesis that these drawbacks of TTA can be worked around with efficient IC design.

signal processing systems | 2015

Code Density and Energy Efficiency of Exposed Datapath Architectures

Pekka Jääskeläinen; Heikki Kultala; Timo Viitanen; Jarmo Takala

Exposing details of the processor datapath to the programmer is motivated by improvements in the energy efficiency and the simplification of the microarchitecture. However, an instruction format that can control the data path in a more explicit manner requires more expressiveness when compared to an instruction format that implements more of the control logic in the processor hardware and presents conventional general purpose register based instructions to the programmer. That is, programs for exposed datapath processors might require additional instruction memory bits to be fetched, which consumes additional energy. With the interest in energy and power efficiency rising in the past decade, exposed datapath architectures have received renewed attention. Several variations of the additional details to expose to the programmer have been proposed in the academy, and some exposed datapath features have also appeared in commercial architectures. The different variations of proposed exposed datapath architectures and their effects to the energy efficiency, however, have not so far been analyzed in a systematic manner in public. This article provides a review of exposed datapath approaches and highlights their differences. In addition, a set of interesting exposed datapath design choices is evaluated in a closer study. Due to the fact that memories constitute a major component of power consumption in contemporary processors, we analyze instruction encodings for different exposed datapath variations and consider the energy required to fetch the additional instruction bits in comparison to the register file access savings achieved with the exposed datapath.

international conference on embedded computer systems architectures modeling and simulation | 2014

Variable length instruction compression on Transport Triggered Architectures

Janne Helkala; Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala; Tommi Zetterman; Heikki Berg

The memories used for embedded microprocessor devices consume a large portion of the system’s power. The power dissipation of the instruction memory can be reduced by using code compression methods, which may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode is challenging for static multiple-issue processors, which aim for low power consumption on embedded platforms. The memory-side power savings using compression are easily lost on inefficient fetch unit design. We propose an implementation for instruction template-based compression and two instruction fetch alternatives for variable length instruction encoding on transport triggered architecture, a static multiple-issue exposed data path architecture. With applications from the CHStone benchmark suite, the compression approach reaches an average compression ratio of 44% at best. We show that the variable length fetch designs reduce the number of memory accesses and often allow the use of a smaller memory component. The proposed compression scheme reduced the energy consumption of synthesized benchmark processors by 15% and area by 33% on average.

international conference on acoustics, speech, and signal processing | 2014

A high throughput LDPC decoder using a mid-range GPU

Xie Wen; Jiao Xianjun; Pekka Jääskeläinen; Heikki Kultala; Chen Canfeng; Heikki Berg; Bie Zhisong

A standard-throughput-approaching LDPC decoder has been implemented on a mid-range GPU in this paper. Turbo-Decoding Message-Passing algorithm is applied to achieve high throughput. Different from traditional host managed multi-streams to hide host-device transfer delay, we use kernel maintained data transfer scheme to achieve implicit data transfer between host memory and device shared memory, which eliminates an intermediate stage of global memory. Data type optimization, memory accessing optimization, and low complexity Soft-In Soft-Out algorithm are also used to improve efficiency. Through these optimization methods, the 802.11n LDPC decoder on NVIDIA GTX480 GPU, which is released in 2010 with Fermi architecture, has achieved a high throughput of 295Mb/s when decoding 512 codewords simultaneously, which is close to highest bit rate 300Mb/s with 20MHz bandwidth in 802.11n standard. Decoding 1024 and 4096 codewords achieve 330 and 365Mb/s. A 802.16e LDPC decoder is also implemented, 374Mb/s (512 codewords), 435Mb/s (1024 codewords) and 507Mb/s (4096 codewords) throughputs have been achieved.

international conference on wireless communications and mobile computing | 2013

Turbo decoding on tailored OpenCL processor

Heikki Kultala; Otto Esko; Pekka Jääskeläinen; Vladimír Guzma; Jarmo Takala; Jiao Xianjun; Tommi Zetterman; Heikki Berg

Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).

embedded and ubiquitous computing | 2008

Reducing Context Switch Overhead with Compiler-Assisted Threading

Pekka Jääskeläinen; Pertti Kellomäki; Jarmo Takala; Heikki Kultala; Mikael Lepistö

Multithreading is an important software modularization technique. However, it can incur substantial overheads, especially in processors where the amount of architecturally visible state is large. We propose an implementation technique for co-operative multithreading, where context switches occur in places that minimize the amount of state that needs to be saved. The subset of processor state saved during each context switch is based on where the switch occurs.We have validated the approach by an empirical study of resource usage in basic blocks, and by implementing the co-operative threading in our compiler. Performance figures are given for an MP3 player utilizing the threading implementation.

ACM Transactions on Architecture and Code Optimization | 2016

Integer Linear Programming-Based Scheduling for Transport Triggered Architectures

Tomi Äijö; Pekka Jääskeläinen; Tapio Elomaa; Heikki Kultala; Jarmo Takala

Static multi-issue machines, such as traditional Very Long Instructional Word (VLIW) architectures, move complexity from the hardware to the compiler. This is motivated by the ability to support high degrees of instruction-level parallelism without requiring complicated scheduling logic in the processor hardware. The simpler-control hardware results in reduced area and power consumption, but leads to a challenge of engineering a compiler with good code-generation quality. Transport triggered architectures (TTA), and other so-called exposed datapath architectures, take the compiler-oriented philosophy even further by pushing more details of the datapath under software control. The main benefit of this is the reduced register file pressure, with a drawback of adding even more complexity to the compiler side. In this article, we propose an Integer Linear Programming (ILP)-based instruction scheduling model for TTAs. The model describes the architecture characteristics, the particular processor resource constraints, and the operation dependencies of the scheduled program. The model is validated and measured by compiling application kernels to various TTAs with a different number of datapath components and connectivity. In the best case, the cycle count is reduced to 52% when compared to a heuristic scheduler. In addition to producing shorter schedules, the number of register accesses in the compiled programs is generally notably less than those with the heuristic scheduler; in the best case, the ILP scheduler reduced the number of register file reads to 33% of the heuristic results and register file writes to 18%. On the other hand, as expected, the ILP-based scheduler uses distinctly more time to produce a schedule than the heuristic scheduler, but the compilation time is within tolerable limits for production-code generation.

international conference on computer graphics and interactive techniques | 2015

MergeTree: a HLBVH constructor for mobile systems

Timo Viitanen; Matias Koskela; Pekka Jääskeläinen; Heikki Kultala; Jarmo Takala

Powerful hardware accelerators have been recently developed that put interactive ray-tracing even in the reach of mobile devices. However, supplying the rendering unit with up-to date acceleration trees remains difficult, so the rendered scenes are mostly static. The restricted memory bandwidth of a mobile device is a challenge with applying GPU-based tree construction algorithms. This paper describes MergeTree, a BVH tree constructor architecture based on the HLBVH algorithm, whose main features of interest are a streaming hierarchy emitter, an external sorting algorithm with provably minimal memory usage, and a hardware priority queue used to accelerate the external sort. In simulations, the resulting unit is faster by a factor of three than the state-of-the art hardware builder based on the binned SAH sweep algorithm.

signal processing systems | 2014

Compiler optimizations for code density of variable length instructions

Heikki Kultala; Timo Viitanen; Pekka Jääskeläinen; Janne Helkala; Jarmo Takala

Variable length encoding can considerably decrease code size in VLIW processors by decreasing the amount of bits wasted on encoding No Operation(NOP)s. A processor may have different instruction templates where different execution slots are implicitly NOPs, but all combinations of NOPs may not be supported by the instruction templates. The efficiency of the NOP encoding can be improved by the compiler trying to place NOPs in such way that the usage of implicit NOPs is maximized. Two different methods of optimizing the use of the implicit NOP slots are evaluated: prioritizing function units that have fewer implicit NOPs associated to them, and a post-pass to the instruction scheduler which utilizes the slack of the schedule by rescheduling operations with slack into different instruction words so that the available instruction templates are better utilized. The post-pass optimizer saved an average of 2.5 % and at best of 9.1 % instruction memory, without performance loss. Prioritizing function units gave best case instruction memory savings of 12.7 % but the average savings were only 1.0 % and there was in average 5.7 % slowdown for the program.

asilomar conference on signals, systems and computers | 2011

Operation set customization in retargetable compilers

Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala

The core tool in Application-Specific Instruction Set Processor (ASIP) design toolsets is a retargetable compiler, which can generate efficient code to any processor developed with the toolset. Such a compiler must automatically adapt itself to the operation set supported by the designed processor by emulating missing instructions with other instructions and by selecting custom instructions automatically whenever possible. This paper proposes a simplified Directed Acyclic Graph-based recursive mechanism to support operation set customization. The proposed mechanism is capable of generating instruction selectors and architecture simulation models automatically, thus is suitable for fast design space exploration of ASIP operation sets.

Explore More