Catalin Bogdan Ciobanu

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Catalin Bogdan Ciobanu is active.

Explore More

Publication

Featured researches published by Catalin Bogdan Ciobanu.

international symposium on microarchitecture | 2010

The SARC Architecture

Alex Ramirez; Felipe Cabarcas; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Friman Sánchez; Arnaldo Azevedo; Cor Meenderinck; Catalin Bogdan Ciobanu; Sebastian Isaza; Gerogi Gaydadjiev

The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARCs programming model supports various highly parallel applications, with matching support from specialized accelerator processors.

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

A Polymorphic Register File for matrix operations

Catalin Bogdan Ciobanu; Georgi Kuzmanov; Georgi Gaydadjiev; Alex Ramirez

Previous vector architectures divided the available register file space in a fixed number of registers of equal sizes and shapes. We propose a register file organization which allows dynamic creation of a variable number of multidimensional registers of arbitrary sizes referred to as a Polymorphic Register File. Our objective is to evaluate the performance benefits of the proposed organization. Simulation results using real applications (Floyd and CG) suggest speedups of up to 3 times compared to the Cell SPU for Floyd and 2 times compared to a one dimensional vectorized version of the sparse matrix vector multiplication. Moreover, in the same experimental context, a large reduction in the number of executed instructions of up to 3000 times for Floyd and 2000 times for sparse matrix vector multiplication is achieved.

automation, robotics and control systems | 2011

Scalability evaluation of a polymorphic register file: A CG case study

Catalin Bogdan Ciobanu; Xavier Martorell; Georgi Kuzmanov; Alex Ramirez; Georgi Gaydadjiev

We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.

computing frontiers | 2009

Wave field synthesis for 3D audio: architectural prospectives

Dimitris Theodoropoulos; Catalin Bogdan Ciobanu; Georgi Kuzmanov

In this paper, we compare the architectural perspectives of the Wave Field Synthesis (WFS) 3D-audio algorithm mapped on three different platforms: a General Purpose Processor (GPP), a Graphics Processor Unit (GPU) and a Field Programmable Gate Array (FPGA). Previous related work reveals that, up to now, WFS sound systems are based on standard PCs. However, on one hand, contemporary GPUs consist of many multiprocessors that can process data concurrently. On the other hand, recent FPGAs provide huge level of parallelism, and reasonably high performance potentials, which can be exploited very efficiently by smart designers. Furthermore, new parallel programming environments, such as the Compute Unified Device Architecture (CUDA) from NVidia and the Stream from ATI, give to the researchers full access to the GPU resources. We use the CUDA to map the WFS kernel on a GeForce 8600GT GPU. Additionally, we implement a reconfigurable and scalable hardware accelerator for the same kernel, and map it onto Virtex4 FPGAs. We compare both architectural approaches against a baseline GPP implementation on a Pentium D at 3.4 GHz. Our conclusion is that in highly demanding WFS-based audio systems, a low-cost GeForce 8600GT desktop GPU can achieve a speedup of up to 8x comparing to a modern Pentium D implementation. An FPGA-based WFS hardware accelerator consisting of a single rendering unit (RU), can provide a speedup of up 10x comparing to the Pentium D approach. It can fit into small FPGAs and consumes approximately 3 Watts. Furthermore, cascading multiple RUs into a larger FPGA, can boost processing throughput up to more than two orders of magnitude higher than a GPP-based implementation and an order of magnitude better than a low-cost GPU one.

reconfigurable communication centric systems on chip | 2012

On implementability of Polymorphic Register Files

Catalin Bogdan Ciobanu; Georgi Kuzmanov; Georgi Gaydadjiev

This paper studies the implementability of performance efficient multi-lane Polymorphic Register Files (PRFs). Our PRF implementation uses a 2D array of p × q linearly addressable memory banks, with customized addressing functions to avoid address routing circuits. We target one single-view and a set of four non redundant multi-view parallel memory schemes that cover all widely used access patterns in scientific and multimedia applications: 1) p × q rectangle, p · q row, p · q main and secondary diagonals; 2) p × q rectangle, p · q column, p · q main and secondary diagonals; 3) p · q row, p · q column, aligned p × q rectangle; 4) p × q, q × p rectangles (transposition). Reconfigurable hardware was chosen for the implementation due to its potential in enhancing the PRF runtime adaptability. For a proof of concept, we prototyped a 2 read, 1 write ports PRF on a Virtex-7 XC7VX1140T-2 FPGA. We consider four sizes for the 16 lanes PRFs - 16 × 16, 32 × 32, 64 × 64 and 128 × 128 and three multi-lane configurations, 8, 16 and 32, for the 128 × 128 PRF. Synthesis results suggest clock frequencies between 111 MHz and 326 MHz while utilizing less than 10% of the available LUTs. By using customized addressing functions, the LUT usage is reduced by up to 29% and the clock frequency is up to 77% higher compared to a straight-forward implementation.

digital systems design | 2012

Scalability Study of Polymorphic Register Files

Catalin Bogdan Ciobanu; Georgi Kuzmanov; Georgi Gaydadjiev

We study the scalability of multi-lane 2D Polymorphic Register Files (PRFs) in terms of clock cycle time, chip area and power consumption. We assume an implementation which stores data in a 2D array of linearly addressable memory banks, and consider one single-view and four suitable multi-view parallel access schemes which cover all basic access patterns commonly used in scientific and multimedia applications. The PRF design features 2 read and 1 write ports, targeting the TSMC 90nm ASIC technology. We consider three PRF sizes - 32KB, 128KB and 512KB and four multi-lane configurations - 8 / 16 / 32 and 64 lanes. Synthesis results suggest that the clock frequency varies between 500MHz for a 512KB PRF with 64 vector lanes and 970Mhz for a 32KB / 8-lanes case. Estimated power consumption ranges from less than 300mW (dynamic) and 10mW (leakage) for our 8-lane, 32KB PRF up to 8.7W (dynamic) and 276mW (leakage) for a 512KB with 64 lanes. We also show the correlation among the storage capacity, the number of lanes, and the chip overall area. Furthermore, we also investigated customized addressing functions. Our experimental results suggest up to 21% increase of the clock frequency, and up to 39% combinational hardware area reduction (nearly 10% of the total area) compared to our straightforward implementations. Concerning power, we reduce dynamic power with up to 31% and leakage with nearly 24%.

international conference on supercomputing | 2014

Real-Time Olivary Neuron Simulations on Dataflow Computing Machines

Georgios Smaragdos; Craig Davies; Christos Strydis; Ioannis Sourdis; Catalin Bogdan Ciobanu; Oskar Mencer; Chris I. De Zeeuw

The Inferior-Olivary nucleus ION is a well-charted brain region, heavily associated with the sensorimotor control of the body. It comprises neural cells with unique properties which facilitate sensory processing and motor-learning skills. Simulations of such neurons become rapidly intractable when biophysically plausible models and meaningful network sizes at least in the order of some hundreds of cells are modeled. To overcome this problem, we accelerate a highly detailed ION network model using a Maxeler Dataflow Computing Machine. The design simulates a 330-cell network at real-time speed and achieves maximum throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-6 FPGA, yields speedups of ×92-102, and ×2-8 compared to a reference-C implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7 FPGA implementation, respectively.

automation, robotics and control systems | 2013

Separable 2d convolution with polymorphic register files

Catalin Bogdan Ciobanu; Georgi Gaydadjiev

This paper studies the performance of separable 2D convolution on multi-lane Polymorphic Register Files (PRFs). We present a matrix transposition algorithm optimized for PRFs, and a 2D vectorized convolution algorithm which avoids strided memory accesses. We compare the throughput of our PRF to the nVidia Tesla C2050 GPU. The results show that even in bandwidth constrained systems, multi-lane PRFs can outperform the GPU for 9 ×9 or larger mask sizes.

computational science and engineering | 2015

EXTRA: Towards an Efficient Open Platform for Reconfigurable High Performance Computing

Catalin Bogdan Ciobanu; Ana Lucia Varbanescu; Dionisios N. Pnevmatikatos; George Charitopoulos; Xinyu Niu; Wayne Luk; Marco D. Santambrogio; Donatella Sciuto; Muhammed Al Kadi; Michael Huebner; Tobias Becker; Georgi Gaydadjiev; Andreas Brokalakis; Antonis Nikitakis; Alex J. W. Thom; Elias Vansteenkiste; Dirk Stroobandt

To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. In the EXTRA project, we create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. EXTRA covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will improve Europes competitive advantage and leadership in the field.

international conference on supercomputing | 2013

FASTER run-time reconfiguration management

Catalin Bogdan Ciobanu; Dionisios N. Pnevmatikatos; Kyprianos Papadimitriou; Georgi Gaydadjiev

The FASTER project Run-Time System Manager offloads programmers from low-level operations by performing task placement, scheduling, and dynamic FPGA reconfiguration. It also manages device fragmentation, configuration caching, pre-fetching and reuse, bitstream compression, and optimizes the system thermal and power footprints. We propose a micro-reconfiguration aware, configuration content agnostic ISA interface and a technology independent Task Configuration Microcode format targeting Maxeler Data Flow computers and Xilinx XUPV5 platforms. We achieve improved resource utilization with negligible performance overhead. Up to 4Gbps for DMA transfers, and up to 3Gbps for FPGA reconfiguration on Xilinx Virtex-5/6 devices is achieved.

Explore More