Heikki Berg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Heikki Berg is active.

Explore More

Publication

Featured researches published by Heikki Berg.

International Journal of Parallel Programming | 2015

pocl: A Performance-Portable OpenCL Implementation

Pekka Jääskeläinen; Carlos S. de La Lama; Kalle Raiskila; Jarmo Takala; Heikki Berg

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

consumer communications and networking conference | 2006

Advanced coding schemes for a multiband OFDM ultrawideband system towards 1 Gbps

Timo Erkki Lunttila; Sassan Iraji; Heikki Berg

Advanced coding schemes such as parallel concatenated zigzag (PCZZ) and LDPC codes are shown to provide performance close to turbo codes. In this paper we propose PCZZ and structured LDPC coding for an MB-OFDM UWB system with 16-QAM modulation aiming at increasing the data rate of the current system up to 1 Gbps. We evaluate the performance of the PCZZ and LDPC coding schemes in such a system. In particular, it is shown that the PCZZ and LDPC codes provide 3-4 dB gain in packet error rate compared to convolutional codes. We also address the link budget and achievable ranges for such a system with different coding rates. The proposed PCZZ and structured LDPC codes may be considered as a potential channel coding scheme for the future Multiband OFDM UWB systems providing data rates up to 1 Gbps or higher.

Eurasip Journal on Wireless Communications and Networking | 2011

State of the art baseband DSP platforms for Software Defined Radio: A survey

Omer Anjum; Tapani Ahonen; Fabio Garzia; Jari Nurmi; Claudio Brunelli; Heikki Berg

Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.

international symposium on system-on-chip | 2008

Analyzing models of computation for software defined radio applications

Heikki Berg; Claudio Brunelli; Ulf Lücking

Applying design principles and methodologies constituted in the software domain and being adapted to the complete execution environment provides new perspectives for future multi-radio computers. In order to share the underlying hardware resources efficiently, the overall system architecture and related programming model has to support dynamic behavior and extensive changes in the configuration during run-time. The requirements for such a multi-radio computer are demanding, as there will be various radio access stacks with inhomogeneous characteristics executing in parallel. This implies a configuration and control framework, besides the different protocol stacks, that is aware of the managed system in every state and is capable of dynamically scheduling different dataflow graphs corresponding to the applications running on the underlying system. This paper presents the main concepts behind such a reactive system, focusing in particular on the proposed model of computation, giving an overview on the software architecture and related problems to be solved.

signal processing systems | 2009

Approximating sine functions using variable-precision Taylor polynomials

Claudio Brunelli; Heikki Berg; David Guevorkian

Sine is one of the fundamental mathematic functions which are widely used in a number of application fields. In particular, signal processing and telecommunications need to calculate sine and cosine of numerical values for several different purposes. One of the challenges which affected the implementation of sine calculation in Digital Signal Processing (DSP) has been the method used to calculate it by means of rational functions, which would allow the implementation of sine calculation in a digital computer system. One possibility is to exploit the Taylor polynomials, even though their main drawback consists of a relatively high grade (thus computational load) already for relatively low-precision approximations. This paper proposes a variable-precision method that allows approximating sine and cosine functions with Taylor polynomials while significantly reducing the computational load required. Our analysis shows how using our method it is possible to achieve the same accuracy marked by other approximation methods, at a lower computational cost.

international conference on wireless communications and mobile computing | 2013

A 122Mb/s Turbo decoder using a mid-range GPU

Jiao Xianjun; Chen Canfeng; Pekka Jääskeläinen; Vladimír Guzma; Heikki Berg

Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.

Microprocessors and Microsystems | 2002

Component-based development of DSP software for mobile communication terminals

Kari Jyrkkä; Olli Silven; Olli Ali-Yrkko; Ryan Heidari; Heikki Berg

Abstract DSP software development has been tied down by extreme computational requirements. Furthermore, the DSP development tools available today are less advanced than in other embedded software design. This has lead to DSP software architectures that have not taken into account future expansion needs. Therefore, DSP software architectures have been inherently closed. Now, as system complexity increases, this design methodology becomes more of a burden, since it does not support component-based DSP software development that requires open interfaces. In this paper, mobile-communications DSP software architectures are studied as cases, and key areas for improvements towards more open DSP software development are identified. Proposed solutions are judged against the limited resources of mobile communication terminals and the characteristics of communication DSPs.

international conference on embedded computer systems architectures modeling and simulation | 2014

Variable length instruction compression on Transport Triggered Architectures

Janne Helkala; Timo Viitanen; Heikki Kultala; Pekka Jääskeläinen; Jarmo Takala; Tommi Zetterman; Heikki Berg

The memories used for embedded microprocessor devices consume a large portion of the system’s power. The power dissipation of the instruction memory can be reduced by using code compression methods, which may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode is challenging for static multiple-issue processors, which aim for low power consumption on embedded platforms. The memory-side power savings using compression are easily lost on inefficient fetch unit design. We propose an implementation for instruction template-based compression and two instruction fetch alternatives for variable length instruction encoding on transport triggered architecture, a static multiple-issue exposed data path architecture. With applications from the CHStone benchmark suite, the compression approach reaches an average compression ratio of 44% at best. We show that the variable length fetch designs reduce the number of memory accesses and often allow the use of a smaller memory component. The proposed compression scheme reduced the energy consumption of synthesized benchmark processors by 15% and area by 33% on average.

international conference on acoustics, speech, and signal processing | 2014

A high throughput LDPC decoder using a mid-range GPU

Xie Wen; Jiao Xianjun; Pekka Jääskeläinen; Heikki Kultala; Chen Canfeng; Heikki Berg; Bie Zhisong

A standard-throughput-approaching LDPC decoder has been implemented on a mid-range GPU in this paper. Turbo-Decoding Message-Passing algorithm is applied to achieve high throughput. Different from traditional host managed multi-streams to hide host-device transfer delay, we use kernel maintained data transfer scheme to achieve implicit data transfer between host memory and device shared memory, which eliminates an intermediate stage of global memory. Data type optimization, memory accessing optimization, and low complexity Soft-In Soft-Out algorithm are also used to improve efficiency. Through these optimization methods, the 802.11n LDPC decoder on NVIDIA GTX480 GPU, which is released in 2010 with Fermi architecture, has achieved a high throughput of 295Mb/s when decoding 512 codewords simultaneously, which is close to highest bit rate 300Mb/s with 20MHz bandwidth in 802.11n standard. Decoding 1024 and 4096 codewords achieve 330 and 365Mb/s. A 802.16e LDPC decoder is also implemented, 374Mb/s (512 codewords), 435Mb/s (1024 codewords) and 507Mb/s (4096 codewords) throughputs have been achieved.

international conference on wireless communications and mobile computing | 2013

Turbo decoding on tailored OpenCL processor

Heikki Kultala; Otto Esko; Pekka Jääskeläinen; Vladimír Guzma; Jarmo Takala; Jiao Xianjun; Tommi Zetterman; Heikki Berg

Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).

Explore More