Vladimír Guzma
Tampere University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vladimír Guzma.
international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2008
Vladimír Guzma; Pekka Jääskeläinen; Pertti Kellomäki; Jarmo Takala
Software bypassing is a technique that allows programmer-controlled direct transfer of results of computations to the operands of data dependent operations, possibly removing the need to store some values in general purpose registers, while reducing the number of reads from the register file. Software bypassing also improves instruction level parallelism by reducing the number of false dependencies between operations caused by the reuse of registers. In this work we show how software bypassing affects cycle count and reduces register file reads and writes. We analyze previous register file bypassing methods and compare them with our improved software bypassing implementation. In addition, we propose heuristics when not to apply software bypassing to retain scheduling freedom when selecting function units for operations. The results show that we get at best 27% improvement to cycle count, as well as up to 48% less register reads and 45% less register writes with the use of bypassing.
signal processing systems | 2009
Vladimír Guzma; Teemu Pitkänen; Pertti Kellomäki; Jarmo Takala
Purpose of embedded computing is to transform input data to output format. Functionality required to achieve this goal is therefore combination of operation executions on computing units and data transfers between those units. To avoid memory bottlenecks, processors use register files to store data during computation.
international conference on wireless communications and mobile computing | 2013
Jiao Xianjun; Chen Canfeng; Pekka Jääskeläinen; Vladimír Guzma; Heikki Berg
Parallel implementations of Turbo decoding has been studied extensively. Traditionally, the number of parallel sub-decoders is limited to maintain acceptable code block error rate performance loss caused by the edge effect of code block division. In addition, the sub-decoders require synchronization to exchange information in the iterative process. In this paper, we propose loosening the synchronization between the sub-decoders to achieve higher utilization of parallel processor resources. Our method allows high degree of parallel processor utilization in decoding of a single code block providing a scalable software-based implementation. The proposed implementation is demonstrated using a graphics processing unit. We achieve 122.8Mbps decoding throughput using a medium range GPU, the Nvidia GTX480. This is, to the best of our knowledge, the fastest Turbo decoding throughput achieved with a GPU-based implementation.
international conference on wireless communications and mobile computing | 2013
Heikki Kultala; Otto Esko; Pekka Jääskeläinen; Vladimír Guzma; Jarmo Takala; Jiao Xianjun; Tommi Zetterman; Heikki Berg
Turbo coding is commonly used in the current wireless standards such as 3G and 4G. However, due to the high computational requirements, its software-defined implementation is challenging. This paper proposes a static multi-issue exposed datapath processor design tailored for turbo decoding. In order to utilize the parallel processor datapath efficiently without resorting to low level assembly programming, the turbo decoder is implemented using OpenCL, a parallel programming standard for heterogeneous devices. The proposed implementation includes only a small set of Turbo-specific custom operations to accelerate the most critical parts of the algorithm. Most of the computation is performed using general-purpose integer operations. Thus, the processor design can be used as a general-purpose OpenCL accelerator for arbitrary integer workloads as well. The proposed processor design was evaluated both by implementing it using a Xilinx Virtex 6 FPGA and by ASIC synthesis using 130 nm and 40 nm technology libraries. The implementation achieves over 63 Mbps Turbo decoding throughput on a single low-power core. According to the ASIC synthesis, the maximum operating clock frequency is 344 MHz/1 050 MHz (130 nm/40 nm).
Eurasip Journal on Embedded Systems | 2013
Vladimír Guzma; Teemu Pitkänen; Jarmo Takala
In the design of embedded systems, hardware and software need to be co-explored together to meet targets of performance and energy. With the use of application-specific instruction-set processors, as a stand-alone solution or as a part of a system on chip, the customization of processors for a particular application is a known method to reduce energy requirements and provide performance. In particular, processor designs with exposed data paths trade compile time complexity for simplified control hardware and lower running costs. An exposed data path also allows the removal of unused components of interconnection network, once the application is compiled.In this paper, we propose the use of a compiler technique for processors with exposed data paths, called software bypassing. Software bypassing allows the compiler to schedule data transfers between execution units directly, bypassing the use of a general-purpose register file, increasing scheduling freedom, with reduced dependencies induced by the reuse of registers, decreasing the number of read and write accesses to register files, and allowing the use of register files with less read and write ports while maintaining or improving performance and maintaining reprogrammability. We compare our proposal against an architecture exploration technique, connectivity reduction, which finds in compiled application all interconnection network components that are used and removes those which are not, leading to an energy-efficient application-specific instruction-set processor.We observe that the use of software bypassing leads to improvements in application speed, with architectures having the smallest number of register file ports consistently outperforming architectures with larger number of ports, and reduction in energy consumption. In contrast, connectivity reduction maintains the same application speed, reduces energy consumption, and allows for increase in processor frequency; however, with the clock frequency increased to match the performance of software bypassing, energy consumption grows. We also observe that in case reprogrammability is not an issue, the most energy-efficient solution is a combination of software bypassing and connectivity reduction.
international symposium on system-on-chip | 2011
Vladimír Guzma; Teemu Pitkänen; Jarmo Takala
In the area of Embedded Systems, instruction memories are one of the critical components consuming significant amounts of energy. Existence of a relation between size of the compiled program, and consequently required size of the instruction memory, and the compiler optimization flags is well-known. In particular, loop transformations such as loop unrolling, while having potential to increase performance dramatically, often cause unreasonable growth in the size of the required instruction memory, causing loss of benefit of lower cycle count from overall system energy point of view.
international symposium on system-on-chip | 2010
Vladimír Guzma; Teemu Pitkänen; Jarmo Takala
Use of Instruction Buffers (also named Repeat Buffers), and caches is common way to avoid memory speed bottleneck in presence of memory hierarchies. Once the instruction resides in a cache or a buffer, repeated execution of the same instruction does not require separate memory access and possible cache miss. Use of the instruction buffers offer also an advantage when low energy consumption is an issue. Reading instruction from the buffer requires order of magnitude less energy then fetch from instruction memory. Keeping memories in the deselect mode and fetching data from the buffer takes roughly half of the power compared to the reading from the memory. In this work, we analyze effects of adding instruction buffer to an existing ASIP architecture. We analyze already generated code of an application, to find the often executed loops, and augment instructions with instruction buffer control information. We show, that for many of embedded applications, storing kernels of execution in the instruction buffer saves between 60 to 87% of instruction memory, even with most trivial loops. This savings can translate to up to 47% reduction of memory energy.
applied sciences on biomedical and communication technologies | 2008
Vladimír Guzma; Shuvra S. Bhattacharyya; Pertti Kellomäki; Jarmo Takala
Application specific instruction set processors (ASIP) allow designers to optimize the architecture of an embedded processor to meet the specific demands of a particular application. A complementary form of customization is provided by domain-specific models of computation (MoCs), which can expose the high level structure of applications that is useful for various kinds of optimizing design transformations. One such MoC is Synchronous Dataflow (SDF), which is used increasingly in the design and implementation of signal processing applications. In this paper, we develop an integration of SDF- and ASIP-oriented design flows, and use this integrated design flow to explore trade-offs in the space of hardware/software implementations. We also explore an approach to ASIP implementation in terms of ldquocriticalrdquo and ldquonon-criticalrdquo applications, which allows designers to tune the degree of specialization for a targeted ASIP. Our results show that single ASIP processor tuned for pair of critical applications saves 26% to 50% of area required for implementations of two applications on separate ASIPs and non-critical applications runs on such processor with in worst case 4.5% overhead for our selection of benchmarks.
international conference on embedded computer systems: architectures, modeling, and simulation | 2011
Vladimír Guzma; Teemu Pitkänen; Jarmo Takala
In this work, we present a minimalistic, energy efficient implementation of instruction buffer. We use loop detection and execution trace analysis to find most commonly executed loops in already scheduled application and tailor instruction buffer size to the size of most commonly executed loop(s). In addition to our previous work, we allow buffering of loops with limited control flow (early exit from the loop or early return to the beginning of the loop). We also show how analysis of loop nests can decrease the number of times loop body is copied from memory into the buffer. Our results show that in case of favorable loop nest, we can execute all but initial loop iterations from the instruction buffer, keeping instruction memory in the deselect mode.
Journal of Systems Architecture | 2008
Pekka Jääskeläinen; Vladimír Guzma; Viljami Korhonen
Processor simulators are important parts of processor design toolsets in which they are used to verify and evaluate the properties of the designed processors. While simulating architectures with independent function unit pipelines using simulation techniques that avoid the overhead of instruction bit-string interpretation, such as compiled simulation, the simulation of function unit pipelines can become one of the new bottlenecks for simulation speed. This paper evaluates several resource conflict detection models, commonly used in compiler instruction scheduling, in the context of function unit pipeline simulation. The evaluated models include the conventional reservation table based-model, the dynamic collision matrix model, and an finite state automata (FSA) based model. In addition, an improvement to the simulation initialization time by means of lazy initialization of states in the FSA-based approach is proposed. The resulting model is faster to initialize and provides comparable simulation speed to the actively initialized FSA.