Nikolaos Bellas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikolaos Bellas is active.

Explore More

Publication

Featured researches published by Nikolaos Bellas.

field-programmable custom computing machines | 2011

Synthesis of Platform Architectures from OpenCL Programs

Muhsen Owaida; Nikolaos Bellas; Konstantis Daloukas; Christos D. Antonopoulos

The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this paper, we use OpenCL, an industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our architectural synthesis tool, SOpenCL (Silicon-OpenCL), adapts OpenCL into a novel hardware design flow which efficiently maps coarse and fine-grained parallelism of an application onto an FPGA reconfigurable fabric. SOpenCL is based on a source-to-source code transformation step that coarsens the OpenCL fine-grained parallelism into a series of nested loops, and on a template-based hardware generation back-end that configures the accelerator based on the functionality and the application performance and area requirements. Our experimentation with a variety of OpenCL and C kernel benchmarks reveals that area, throughput and frequency optimized hardware implementations are attainable using SOpenCL.

international symposium on low power electronics and design | 1999

Using dynamic cache management techniques to reduce energy in a high-performance processor

Nikolaos Bellas; Ibrahim N. Hajj; Constantine D. Polychronopoulos

In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the instruction cache (I-Cache) and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. In this work, we propose, implement, and evaluate a series of run-time techniques for dynamic analysis of the program instruction access behavior, which are then used to preactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes.

international conference on computer design | 1999

Energy and performance improvements in microprocessor design using a loop cache

Nikolaos Bellas; Ibrahim N. Hajj; Constantine D. Polychronopoulos; George Stamoulis

Energy dissipated in on-chip caches represents a substantial portion in the energy budget of todays processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. We extend the work proposed by J. Kin et al. (1997), in which an extra, small cache (called filter cache) is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU. In our scheme, the compiler is used to generate code that exploits the new memory hierarchy and reduces the possibility of a miss in the extra cache. Experimental results across a wide range of SPEC95 benchmarks show that this cache, which we call L-Cache, has a small performance overhead with respect to the scheme without any extra caches, and provides substantial energy savings. The L-Cache is placed between the CPU and the I-Cache. The D-Cache subsystem is not modified. Since the L-Cache is much smaller, and thus, has a smaller access time than the I-Cache, this scheme can also be used for performance improvements provided that the hit rate in the L-Cache is very high. In our experimental results, we show that the L-Cache does indeed improve performance in some cases.

international parallel and distributed processing symposium | 2006

FPGA implementation of a license plate recognition SoC using automatically generated streaming accelerators

Nikolaos Bellas; Sek M. Chai; Malcolm Dwyer; Dan Linzmeier

Modern FPGA platforms provide the hardware and software infrastructure for building a bus-based system on chip (SoC) that meet the applications requirements. The designer can customize the hardware by selecting from a large number of pre-defined peripherals and fixed IP functions and by providing new hardware, typically expressed using RTL. Hardware accelerators that provide application-specific extensions to the computational capabilities of a system is an efficient mechanism to enhance the performance and reduce the power dissipation. What is missing is an integrated approach to identify the computationally critical parts of the application and to create accelerators starting from a high level representation with a minimal design effort. In this paper, we present an automation methodology and a tool that generates accelerators. We apply the methodology on an FPGA-based license plate recognition (LPR) system used in law enforcement. The accelerators process streaming data and support a programming model which can naturally express a large number of embedded applications resulting in efficient hardware implementations. We show that we can achieve an overall LPR application speed up from 1.2times to 2.6times, thus enabling real-time functionality under realistic road scenes

system level interconnect prediction | 2000

Using dynamic cache management techniques to reduce energy in general purpose processors

Nikolaos Bellas; Ibrahim N. Hajj; Constantine D. Polychronopoulos

The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intels Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved.

dependable systems and networks | 2014

GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates

Konstantinos Parasyris; Georgios Tziantzoulis; Christos D. Antonopoulos; Nikolaos Bellas

Dependable computing on unreliable substrates is the next challenge the computing community needs to overcome due to both manufacturing limitations in low geometries and the necessity to aggressively minimize power consumption. System designers often need to analyze the way hardware faults manifest as errors at the architectural level and how these errors affect application correctness. This paper introduces GemFI, a fault injection tool based on the cycle accurate full system simulator Gem5. GemFI provides fault injection methods and is easily extensible to support future fault models. It also supports multiple processor models and ISAs and allows fault injection in both functional and cycle-accurate simulations. GemFI offers fast-forwarding of simulation campaigns via check pointing. Moreover, it facilitates the parallel execution of campaign experiments on a network of workstations. In order to validate and evaluate GemFI, we used it to apply fault injection on a series of real-world kernels and applications. The evaluation indicates that its overhead compared with Gem5 is minimal (up to 3.3%), whereas optimizations such as fast-forwarding via check pointing and execution on NoWs can significantly reduce simulation time of a fault injection campaign.

field-programmable custom computing machines | 2009

Real-Time Fisheye Lens Distortion Correction Using Automatically Generated Streaming Accelerators

Nikolaos Bellas; Sek M. Chai; Malcolm Dwyer; Dan Linzmeier

Fisheye lenses are often used in scientific or virtual reality applications to enlarge the field of view of a conventional camera. Fisheye lens distortion correction is an image processing application which transforms the distorted fisheye images back to the natural-looking perspective space. This application is characterized by non-linear streaming memory access patterns that make main memory bandwidth a key performance limiter. We have developed a fisheye lens distortion correction system on a custom board that includes a Xilinx Virtex-4 FPGA. We express the application in a high level streaming language, and we utilize Proteus, an architectural synthesis tool, to quickly explore the design space and generate the streaming accelerator best suited for our cost and performance constraints. This paper shows that appropriate ESL tools enable rapid prototyping and design of real-life, performance critical and cost sensitive systems with complex memory access patterns and hardware-software interaction mechanisms.

international conference on consumer electronics | 2010

Implementation of the AVS video decoder on a heterogeneous dual-core SIMD processor

Maria G. Koziri; Dimitrios Zacharis; Ioannis Katsavounidis; Nikolaos Bellas

Multi-core Application Specific Instruction Processors (ASIPs) are increasingly used in multimedia applications due to their high performance and programmability. Nonetheless, their efficient use requires extensive modifications to the initial code in order to exploit the features of the underlying architecture. In this paper, through the example of implementing Advance Video Coding (AVS) to a heterogeneous dual-core SIMD processor, we present a guide to developers who wish to perform task-level decomposition of any video decoder in a multi-core SIMD system. Through the process of mapping AVS video decoder to a dual-core SIMD processor we aim to explore the different forms of parallelism inherent in a video application and exploit to speed-up AVS decoding in order to achieve real time functionality. Simulation results showed that the extraction of parallelism at all levels of granularity, especially at the higher levels, can give a total speed-up of more than 195× compared to a software x86-based implementation, which enables realtime, 25fps decoding of D1 video.

field-programmable custom computing machines | 2012

Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model: The LDPC Decoder Case

Gabriel Falcao; Muhsen Owaida; David Novo; Madhura Purnaprajna; Nikolaos Bellas; Christos D. Antonopoulos; Georgios Karakonstantis; Andreas Burg; Paolo Ienne

Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bit width and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on Open CL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. Open CL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpen CL (Silicon to Open CL), a tool that automatically converts Open CL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified Open CL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3× faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.

IEEE Transactions on Computers | 1996

Algorithm-based error-detection schemes for iterative solution of partial differential equations

Amber Roy-Chowdhury; Nikolaos Bellas; Prithviraj Banerjee

Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. For numerical applications involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. We describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage. We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.

Explore More