Konstantinos Masselos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Konstantinos Masselos is active.

Explore More

Publication

Featured researches published by Konstantinos Masselos.

field-programmable logic and applications | 2008

Combining data reuse exploitationwith data-level parallelization for FPGA targeted hardware compilation: A geometric programming framework

Qiang Liu; George A. Constantinides; Konstantinos Masselos; Peter Y. K. Cheung

A nonlinear optimization framework is proposed in this paper to automate exploration of the design space consisting of data-reuse (buffering) decisions and loop-level parallelization, in the context of field-programmable-gate-array-targeted hardware compilation. Buffering frequently accessed data in on-chip memories can reduce off-chip memory accesses and open avenues for parallelization. However, the exploitation of both data reuse and parallelization is limited by the memory resources available on-chip. As a result, considering these two problems separately, e.g., first exploring data reuse and then exploring data-level parallelization, based on the data-reuse options determined in the first step, may not yield the performance-optimal designs for limited on-chip memory resources. We consider both problems at the same time, exposing the dependence between the two. We show that this combined problem can be formulated as a nonlinear program and further show that efficient solution techniques exist for this problem, based on recent advances in optimization of so-called geometric programming problems. The results from applying this framework to several real benchmarks implemented on a Xilinx device demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework. We have also implemented designs determined by a two-stage optimization method that first explores data reuse and then explores parallelization on the same platform, and by comparison, the performance-optimal designs proposed by our framework are faster than the designs determined by the two-stage method by up to 5.7 times.

signal processing systems | 2008

Implementation and Comparison of the 5/3 Lifting 2D Discrete Wavelet Transform Computation Schedules on FPGAs

Maria E. Angelopoulou; Konstantinos Masselos; Peter Y. K. Cheung; Yiannis Andreopoulos

The suitability of the 2D Discrete Wavelet Transform (DWT) as a tool in image and video compression is nowadays indisputable. For the execution of the multilevel 2D DWT, several computation schedules based on different input traversal patterns have been proposed. Among these, the most commonly used in practical designs are: the row–column, the line-based and the block-based. In this work, these schedules are implemented on FPGA-based platforms for the forward 2D DWT by using a lifting-based filter-bank implementation. Our designs were realized in VHDL and optimized in terms of throughput and memory requirements, in accordance with the principles of both the schedules and the lifting decomposition. The implementations are fully parameterized with respect to the size of the input image and the number of decomposition levels. We provide detailed experimental results concerning the throughput, the area, the memory requirements and the energy dissipation, associated with every point of the parameter space. These results demonstrate that the choice of the suitable schedule is a decision that should be dependent on the given algorithmic specifications.

IEEE Transactions on Very Large Scale Integration Systems | 2008

Outer Loop Pipelining for Application Specific Datapaths in FPGAs

Kieron Turkington; George A. Constantinides; Konstantinos Masselos; Peter Y. K. Cheung

Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time.

field-programmable custom computing machines | 2007

Automatic On-chip Memory Minimization for Data Reuse

Qiang Liu; George A. Constantinides; Konstantinos Masselos; Peter Y. K. Cheung

FPGA-based computing engines have become a promising option for the implementation of computationally intensive applications due to high flexibility and parallelism. However, one of the main obstacles to overcome when trying to accelerate an application on an FPGA is the bottleneck in off-chip communication, typically to large memories. Often it is known at compile-time that the same data item is accessed many times, and as a result can be loaded once from large off-chip RAM onto scarce on-chip RAM, alleviating this bottleneck. This paper addresses how to automatically derive an address mapping that reduces the size of the required on-chip memory for a given memory access pattern. Experimental results demonstrate that, in practice, our approach reduces on-chip storage requirements to the minimum, corresponding to a reduction in on-chip memory size of up to 40times (average 10times) for some benchmarks compared to a naive approach. At the same time, no clock period penalty or increase in control logic area compared to this approach is observed for these benchmarks.

field-programmable logic and applications | 2006

FPGA Based Acceleration of the Linpack Benchmark: A High Level Code Transformation Approach

Kieron Turkington; Konstantinos Masselos; George A. Constantinides; Philip Heng Wai Leong

Due to their increasing resource densities, field programmable gate arrays (FPGAs) have become capable of efficiently implementing large scale scientific applications involving floating point computations. In this paper FPGAs are compared to a high end microprocessor with respect to sustained performance for a popular floating point CPU performance benchmark, namely LINPACK 1000. A set of translation and optimization steps have been applied to transform a sequential C description of the LINPACK benchmark, based on a monolithic memory model, into a parallel Handel-C description that utilizes the plurality of memory resources available on a realistic reconfigurable computing platform. The experimental results show that the latest generation of FPGAs, programmed using Handel-C, can achieve a sustained floating point performance up to 6 times greater than the microprocessor while operating at a clock frequency that is 60 times lower. The transformations are applied in a way that could be generalized, allowing efficient compilation approaches for the mapping of high level descriptions onto FPGAs.

Eurasip Journal on Embedded Systems | 2007

Implementation of wireless communications systems on FPGA-based platforms

Konstantinos Masselos; Nikolaos S. Voros

Wireless communications are a very popular application domain. The efficient implementation of their components (access points and mobile terminals/network interface cards) in terms of hardware cost and design time is of great importance. This paper describes the design and implementation of the HIPERLAN/2 WLAN system on a platform including general purpose microprocessors and FPGAs. Detailed implementation results (performance, code size, and FPGA resources utilization) are presented. The main goal of the design case presented is to provide insight into the design aspects of a complex system based on FPGAs. The results prove that an implementation based on microprocessors and FPGAs is adequate for the access point part of the system where the expected volumes are rather small. At the same time, such an implementation serves as a prototyping of an integrated implementation (System-on-Chip), which is necessary for the mobile terminals of a HIPERLAN/2 system. Finally, firmware upgrades were developed allowing the implementation of an outdoor wireless communication system on the same platform.

field-programmable logic and applications | 2006

Data Reuse Exploration for FPGA Based Platforms Applied to the Full Search Motion Estimation Algorithm

Qiang Liu; Konstantinos Masselos; George A. Constantinides

Compilation of high level descriptions to field programmable gate array hardware forms a promising option for the efficient mapping of computationally intensive applications under tight development time constraints. In this paper data reuse exploration on top of an existing hardware compilation environment is discussed. The full search motion estimation algorithm for video processing is used as a test vehicle. The systematic approach adopted for the exploration of the data reuse space is described. Experimental results prove that the exploitation of data reuse may lead to more than 80% reduction of the execution time and up to 95% reduction of the off-chip memory accesses.

field-programmable technology | 2006

A comparison of 2-D discrete wavelet transform computation schedules on FPGAs

Maria E. Angelopoulou; Konstantinos Masselos; Peter Y. K. Cheung; Yiannis Andreopoulos

When it comes to the computation of the 2D discrete wavelet transform (DWT), three major computation schedules have been proposed, namely the row-column, the line-based and the block-based. In this work, the lifting-based designs of these schedules are implemented on FPGA-based platforms to execute the forward 2D DWT, and their comparison is presented. Our implementations are optimized in terms of throughput and memory requirements, in accordance with the specifications of each one of the three computation schedules and the lifting decomposition. All implementations are parameterized with respect to the image size and the number of decomposition levels. Experimental results prove that the suitability of each implementation for a particular application depends on the given specifications, concerning the throughput and the hardware cost

Iet Computers and Digital Techniques | 2009

Data-reuse exploration under an on-chip memory constraint for low-power FPGA-based systems

Qiang Liu; George A. Constantinides; Konstantinos Masselos; Peter Y. K. Cheung

Contemporary FPGA-based reconfigurable systems have been widely used to implement data-dominated applications. In these applications, data transfer and storage consume a large proportion of the system energy. Exploiting data-reuse can introduce significant power savings, but also introduces the extra requirement for on-chip memory. To aid data-reuse design exploration early during the design cycle, the authors present an optimisation approach to achieve a power-optimal design satisfying an on-chip memory constraint in a targeted FPGA-based platform. The data-reuse exploration problem is mathematically formulated and shown to be equivalent to the multiple-choice knapsack problem. The solution to this problem for an application code corresponds to the decision of which array references are to be buffered on-chip and where loading reused data of the array references into on-chip memory happen in the code, in order to minimise power consumption for a fixed on-chip memory size. The authors also present an experimentally verified power model, capable of providing the relative power information between different data-reuse design options of an application, resulting in a fast and efficient design-space exploration. The experimental results demonstrate that the approach enables us to find the most power-efficient design for all the benchmark circuits tested.

IEEE Transactions on Very Large Scale Integration Systems | 2003

Power efficient data path synthesis of sum-of-products computations

Konstantinos Masselos; Panagiotis Merakos; S. Theoharis; Thanos Stouraitis; Costas E. Goutis

Techniques for the power efficient data path synthesis of sum-of-products computations between data and coefficients are presented. The proposed techniques exploit specific features of this type of computations. Efficient heuristics for the scheduling and assignment tasks, based on the concept of the Traveling Salesmans Problem, are described. Different cost functions are proposed to drive the synthesis tasks. The proposed cost functions target the power consumption either in the interconnect buses or in the functional units. Experimental results from different relevant digital signal processing algorithmic kernels prove that the proposed synthesis techniques lead to significant power savings.

Explore More