Is this you? Create Your Porfile

Timo Viitanen

Tampere University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Timo Viitanen is active.

Explore More

Publication

Featured researches published by Timo Viitanen.

compilers, architecture, and synthesis for embedded systems | 2014

Heuristics for greedy transport triggered architecture interconnect exploration

Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala

Most power dissipation in Very Large Instruction Word (VLIW) processors occurs in their large, multi-port register files. Transport Triggered Architecture (TTA) is a VLIW variant whose exposed datapath reduces the need for RF accesses and ports. However, the comparative advantage of TTAs suffers in practice from a wide instruction word and complex interconnection network (IC). We argue that these issues are at least partly due to suboptimal design choices. The design space of possible TTA architectures is very large, and previous automated and ad-hoc design methods often produce inefficient architectures. We propose a reduced design space where efficient TTAs can be generated in a short time using excecution trace-driven greedy exploration. The proposed approach is evaluated by optimizing the equivalent of a 4-issue VLIW architecture. The algorithm finishes quickly and produces a processor with 10% reduced core energy product compared to a fully-connected TTA. Since the generated processor has low IC power and a shorter instruction word than a typical 4-issue VLIW, the results support the hypothesis that these drawbacks of TTA can be worked around with efficient IC design.

international conference on acoustics, speech, and signal processing | 2013

Simplified floating-point division and square root

Timo Viitanen; Pekka Jääskeläinen; Otto Esko; Jarmo Takala

Digital Signal Processing (DSP) algorithms on low-power embedded platforms are often implemented using fixed-point arithmetic due to expected power and area savings over floating-point computation. However, recent research shows that floating-point arithmetic can be made competitive by using a reduced-precision format instead of, e.g., IEEE standard single precision, thereby avoiding the algorithm design and implementation difficulties associated with fixed-point arithmetic. This paper investigates the effects of simplified floating-point arithmetic applied to an FMA-based floating-point unit and the associated software division and square root operations. Software operations are proposed which attain near-exact precision with twice the performance of exact algorithms and resolve overflow-related errors with inexpensive exponent-manipulation special instructions.

signal processing systems | 2015

Code Density and Energy Efficiency of Exposed Datapath Architectures

Pekka Jääskeläinen; Heikki Kultala; Timo Viitanen; Jarmo Takala

Exposing details of the processor datapath to the programmer is motivated by improvements in the energy efficiency and the simplification of the microarchitecture. However, an instruction format that can control the data path in a more explicit manner requires more expressiveness when compared to an instruction format that implements more of the control logic in the processor hardware and presents conventional general purpose register based instructions to the programmer. That is, programs for exposed datapath processors might require additional instruction memory bits to be fetched, which consumes additional energy. With the interest in energy and power efficiency rising in the past decade, exposed datapath architectures have received renewed attention. Several variations of the additional details to expose to the programmer have been proposed in the academy, and some exposed datapath features have also appeared in commercial architectures. The different variations of proposed exposed datapath architectures and their effects to the energy efficiency, however, have not so far been analyzed in a systematic manner in public. This article provides a review of exposed datapath approaches and highlights their differences. In addition, a set of interesting exposed datapath design choices is evaluated in a closer study. Due to the fact that memories constitute a major component of power consumption in contemporary processors, we analyze instruction encodings for different exposed datapath variations and consider the energy required to fetch the additional instruction bits in comparison to the register file access savings achieved with the exposed datapath.

signal processing systems | 2013

Inexpensive correctly rounded floating-point division and square root with input scaling

Timo Viitanen; Pekka Jääskeläinen; Jarmo Takala

Recent embedded DSPs are incorporating IEEE-compliant floating point arithmetic to ease the development of, e.g., multiple antenna MIMO in software-defined radio. An obvious choice of FPU architecture in DSP is to include a fused multiply-add (FMA) operation, which accelerates most DSP applications. Another advantage of FMA is that it enables fast software algorithms for, e.g., division and square root without much additional hardware. However, these algorithms are nontrivial to perform at the target accuracy to get the correctly rounded result without danger of overflow. Previous FMA-based systems either rely on a power-hungry wide intermediate format or forego correct rounding. A wide format is unattractive in a power-sensitive embedded environment since it requires enlarged register files, wider data buses and possibly a larger multiplier. We present provably correct algorithms for efficient IEEE-compliant division and square root with only a 32-bit format using hardware prescaling and postscaling steps. The required hardware has approximately 8% of area and power footprint of a single FMA unit.

international conference on embedded computer systems architectures modeling and simulation | 2014

Variable length instruction compression on Transport Triggered Architectures

Janne Helkala; Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala; Tommi Zetterman; Heikki Berg

The memories used for embedded microprocessor devices consume a large portion of the system’s power. The power dissipation of the instruction memory can be reduced by using code compression methods, which may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode is challenging for static multiple-issue processors, which aim for low power consumption on embedded platforms. The memory-side power savings using compression are easily lost on inefficient fetch unit design. We propose an implementation for instruction template-based compression and two instruction fetch alternatives for variable length instruction encoding on transport triggered architecture, a static multiple-issue exposed data path architecture. With applications from the CHStone benchmark suite, the compression approach reaches an average compression ratio of 44% at best. We show that the variable length fetch designs reduce the number of memory accesses and often allow the use of a smaller memory component. The proposed compression scheme reduced the energy consumption of synthesized benchmark processors by 15% and area by 33% on average.

international symposium on visual computing | 2016

Foveated Path Tracing

Matias Koskela; Timo Viitanen; Pekka Jääskeläinen; Jarmo Takala

Virtual Reality (VR) places demanding requirements on the rendering pipeline: the rendering is stereoscopic and the refresh rate should be as high as 95 Hz to make VR immersive. One promising technique for making the final push to meet these requirements is foveated rendering, where the rendering effort is prioritized on the areas where the user’s gaze lies. This requires rapid adjustment of level of detail based on screen space coordinates. Path tracing allows this kind of changes without much extra work. However, real-time path tracing is fairly new concept. This paper is a literature review of techniques related to optimizing path tracing with foveated rendering. In addition, we provide a theoretical estimation of performance gains available and calculate that 94% of the paths could be omitted. For this reason we predict that path tracing can soon meet the demanding rendering requirements of VR.

international conference on computer graphics and interactive techniques | 2017

Foveated instant preview for progressive rendering

Matias Koskela; Kalle Immonen; Timo Viitanen; Pekka Jääskeläinen; Joonas Multanen; Jarmo Takala

Progressive rendering, for example Monte Carlo rendering of 360° content for virtual reality headsets, is a time-consuming task. If the 3D artist notices an error while previewing the rendering, he or she must return to editing mode, do the required changes, and restart rendering. Restart is required because the rendering system cannot know which pixels are affected by the change. We propose the use of eye-tracking-based optimization to significantly speed up previewing the artists points of interest. Moreover, we derive an optimized version of the visual acuity model, which follows the original model more accurately than previous work. The proposed optimization was tested with a comprehensive user study. The participants felt that preview with the proposed method converged instantly, and the recorded split times show that the preview is 10 times faster than conventional preview. In addition, the system does not have measurable drawbacks on computational performance.

international conference on computer graphics and interactive techniques | 2016

Multi bounding volume hierarchies for ray tracing pipelines

Timo Viitanen; Matias Koskela; Pekka Jääskeläinen; Jarmo Takala

High-performance ray tracing on CPU is now largely based on Multi Bounding Volume Hierarchy (MBVH) trees. We apply MBVH to a fixed-function ray tracing accelerator architecture. According to cycle-level simulations and power analysis, MBVH reduces energy per frame by an average of 24% and improves performance per area by 19% in scenes with incoherent rays, due to its compact memory layout which reduces DRAM traffic. With primary rays, energy efficiency improves by 15% and performance per area by 20%.

international conference on computer graphics and interactive techniques | 2015

MergeTree: a HLBVH constructor for mobile systems

Timo Viitanen; Matias Koskela; Pekka Jääskeläinen; Heikki Kultala; Jarmo Takala

Powerful hardware accelerators have been recently developed that put interactive ray-tracing even in the reach of mobile devices. However, supplying the rendering unit with up-to date acceleration trees remains difficult, so the rendered scenes are mostly static. The restricted memory bandwidth of a mobile device is a challenge with applying GPU-based tree construction algorithms. This paper describes MergeTree, a BVH tree constructor architecture based on the HLBVH algorithm, whose main features of interest are a streaming hierarchy emitter, an external sorting algorithm with provably minimal memory usage, and a hardware priority queue used to accelerate the external sort. In simulations, the resulting unit is faster by a factor of three than the state-of-the art hardware builder based on the binned SAH sweep algorithm.

signal processing systems | 2014

Compiler optimizations for code density of variable length instructions

Heikki Kultala; Timo Viitanen; Pekka Jääskeläinen; Janne Helkala; Jarmo Takala

Variable length encoding can considerably decrease code size in VLIW processors by decreasing the amount of bits wasted on encoding No Operation(NOP)s. A processor may have different instruction templates where different execution slots are implicitly NOPs, but all combinations of NOPs may not be supported by the instruction templates. The efficiency of the NOP encoding can be improved by the compiler trying to place NOPs in such way that the usage of implicit NOPs is maximized. Two different methods of optimizing the use of the implicit NOP slots are evaluated: prioritizing function units that have fewer implicit NOPs associated to them, and a post-pass to the instruction scheduler which utilizes the slack of the schedule by rescheduling operations with slack into different instruction words so that the available instruction templates are better utilized. The post-pass optimizer saved an average of 2.5 % and at best of 9.1 % instruction memory, without performance loss. Prioritizing function units gave best case instruction memory savings of 12.7 % but the average savings were only 1.0 % and there was in average 5.7 % slowdown for the program.

Explore More