Is this you? Create Your Porfile

Panagiotis Athanasopoulos

École Polytechnique Fédérale de Lausanne

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Panagiotis Athanasopoulos is active.

Explore More

Publication

Featured researches published by Panagiotis Athanasopoulos.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2012

Design and Testing Strategies for Modular 3-D-Multiprocessor Systems Using Die-Level Through Silicon Via Technology

Giulia Beanato; Paolo Giovannini; Alessandro Cevrero; Panagiotis Athanasopoulos; Michael Zervas; Yuksel Temiz; Yusuf Leblebici

An innovative modular 3-D stacked multi-processor architecture is presented. The platform is composed of completely identical stacked dies connected together by through-silicon-vias (TSVs). Each die features four 32-bit embedded processors and associated memory modules, interconnected by a 3-D network-on-chip (NoC), which can route packets in the vertical direction. Superimposing identical planar dies minimizes design effort and manufacturing costs, ensuring at the same time high flexibility and reconfigurability. A single die can be used either as a fully testable standalone chip multi-processor (CMP), or integrated in a 3-D stack, increasing the overall core count and consequently the system performance. To demonstrate the feasibility of this architecture, fully functional samples have been fabricated using a conventional UMC 90 nm complementary metal-oxide-semiconductor process and stacked using an in-house, via-last Cu-TSV process. Initial results show that the proposed 3-D-CMP is capable of operating at a target frequency of 400 MHz, supporting a vertical data bandwidth of 3.2 Gb/s.

system on chip conference | 2010

Design and feasibility of multi-Gb/s quasi-serial vertical interconnects based on TSVs for 3D ICs

Fengda Sun; Alessandro Cevrero; Panagiotis Athanasopoulos; Yusuf Leblebici

This paper proposes a novel technique to exploit the high bandwidth offered by through silicon vias (TSVs). In the proposed approach, synchronous parallel 3D links are replaced by serialized links to save silicon area and increase yield. Detailed analysis conducted in 90 nm CMOS technology shows that the proposed 2-Gb/s/pin quasi-serial link requires approximately five times less area than its parallel bus equivalent at same data rate for a TSV diameter of 20 µm.

field programmable gate arrays | 2008

Architectural improvements for field programmable counter arrays: enabling efficient synthesis of fast compressor trees on FPGAs

Alessandro Cevrero; Panagiotis Athanasopoulos; Hadi Parandeh-Afshar; Ajay K. Verma; Philip Brisk; Frank K. Gürkaynak; Yusuf Leblebici; Paolo Ienne

The Field Programmable Counter Array (FPCA) was introduced to improve FPGA performance for arithmetic circuits. An FPCA is a reconfigurable IP core that can be integrated into an FPGA. To exploit the FPCA, a circuit is transformed by merging disparate addition and multiplication operations into large multi-input addition operations, which are synthesized as compressor trees on the FPCA; the remaining portion of the circuit is synthesized on the FPGA. This paper presents a series of architectural improvements to the FPCA that reduce routing delay, increase flexibility and component utilization, and simplify the integration process. Using an FPGA containing six FPCAs, we observed average and maximum speedups of 1.60x and 2.40x on a set of arithmetic benchmarks

ACM Transactions on Reconfigurable Technology and Systems | 2009

Field Programmable Compressor Trees: Acceleration of Multi-Input Addition on FPGAs

Alessandro Cevrero; Panagiotis Athanasopoulos; Hadi Parandeh-Afshar; Ajay K. Verma; Hosein Seyed Attarzadeh Niaki; Chrysostomos Nicopoulos; Frank K. Gürkaynak; Philip Brisk; Yusuf Leblebici; Paolo Ienne

Multi-input addition occurs in a variety of arithmetically intensive signal processing applications. The DSP blocks embedded in high-performance FPGAs perform fixed bitwidth parallel multiplication and Multiply-ACcumulate (MAC) operations. In theory, the compressor trees contained within the multipliers could implement multi-input addition; however, they are not exposed to the programmer. To improve FPGA performance for these applications, this article introduces the Field Programmable Compressor Tree (FPCT) as an alternative to the DSP blocks. By providing just a compressor tree, the FPCT can perform multi-input addition along with parallel multiplication and MAC in conjunction with a small amount of FPGA general logic. Furthermore, the user can configure the FPCT to precisely match the bitwidths of the operands being summed. Although an FPCT cannot beat the performance of a well-designed ASIC compressor tree of fixed bitwidth, for example, 9×9 and 18×18-bit multipliers/MACs in DSP blocks, its configurable bitwidth and ability to perform multi-input addition is ideal for reconfigurable devices that are used across a variety of applications.

design, automation, and test in europe | 2013

3D-MMC: a modular 3D multi-core architecture with efficient resource pooling

Tiansheng Zhang; Alessandro Cevrero; Giulia Beanato; Panagiotis Athanasopoulos; Ayse Kivilcim Coskun; Yusuf Leblebici

This paper demonstrates a fully functional hardware and software design for a 3D stacked multi-core system for the first time. Our 3D system is a low-power 3D Modular Multi-Core (3D-MMC) architecture built by vertically stacking identical layers. Each layer consists of cores, private and shared memory units, and communication infrastructures. The system uses shared memory communication and Through-Silicon-Vias (TSVs) to transfer data across layers. A serialization scheme is employed for inter-layer communication to minimize the overall number of TSVs. The proposed architecture has been implemented in HDL and verified on a test chip targeting an operating frequency of 400MHz with a vertical bandwidth of 3.2Gbps. The paper first evaluates the performance, power and temperature characteristics of the architecture using a set of software applications we have designed. We demonstrate quantitatively that the proposed modular 3D design improves upon the cost and performance bottlenecks of traditional 2D multi-core design. In addition, a novel resource pooling approach is introduced to efficiently manage the shared memory of the 3D stacked system. Our approach reduces the application execution time significantly compared to 2D and 3D systems with conventional memory sharing.

field-programmable technology | 2009

A flexible DSP block to enhance FPGA arithmetic performance

Hadi Parandeh-Afshar; Alessandro Cevrero; Panagiotis Athanasopoulos; Philip Brisk; Yusuf Leblebici; Paolo Ienne

We propose a new DSP block for use in modern high-performance FPGAs. Current DSP blocks contain fixed-bitwidth multipliers that can be combined efficiently to form larger multipliers. Our approach is similar, but includes a bypass layer following the partial product generator that exposes the compressor tree used for partial product reduction directly to the user. As a consequence, the proposed DSP block can accelerate multi-input addition operations in addition to multiplication. To increase the flexibility of the device, the partial product reduction tree used within our DSP block uses a fixed-function compression logic along with a field programmable compressor tree (FPCT), the latter of which is user-configurable to meet the needs of the application at hand. Multi-input addition operations can be mapped directly onto the FPCT without compromising any of the other functionality of the DSP block.

field-programmable logic and applications | 2009

Using 3D integration technology to realize multi-context FPGAs

Alessandro Cevrero; Panagiotis Athanasopoulos; Hadi Parandeh-Afshar; Maurizio Skerlj; Philip Brisk; Yusuf Leblebici; Paolo Ienne

This paper advocates the use of 3D integration technology to stack a DRAM on top of an FPGA. The DRAM will store future FPGA contexts. A configuration is read from the DRAM into a latch array on the DRAM layer while the FPGA executes; the new configuration is loaded from the latch array into the FPGA in 60ns (5 cycles). The latency between reconfigurations, 8.42µs, is dominated by the time to read data from the DRAM into the latch array. We estimate that the DRAM can cache 289 FPGA contexts.

field programmable gate arrays | 2009

3D configuration caching for 2D FPGAs

Alessandro Cevrero; Panagiotis Athanasopoulos; Hadi Parandeh-Afshar; Philip Brisk; Yusuf Lebebici; Paolo Ienne; Maurizio Skerlj

This poster proposes the use of 3D integration technology to enable low-overhead reconfigurable computing. In our scheme, a 64 Megabyte DRAM array is stacked on top of an FPGA using face-to-face bonding, and caches up to 289 future configurations which can be quickly loaded onto the FPGA. Past DRAMs have been designed for off-chip communication, a bottleneck that 3D stacking eliminates; hence, the DRAM array is redesigned. To reconfigure the FPGA, a configuration is read from the DRAM into a latch array while the FPGA executes; then, the configuration is loaded from the latch array into the FPGA in 5 cycles (60ns). The minimum latency between reconfigurations, 8.42s, is dominated by the time to load data from the DRAM into the latch array. The benefits, area cost, and performance of the proposed system are evaluated on three previously published FPGA implementations of multimedia applications: MP3 and MPEG-4 decoders, and JPEG compression, and are evaluated under three scenarios: No Dynamic ReConfiguration (NDRC), Off-chip Dynamic ReConfiguration (ORDC), and 3D Configuration Caching (3DCC). Our experiments demonstrate that 3D configuration caching works best when used in conjunction with FPGA-based accelerators, rather than pure FPGA-based systems; in these systems, the reconfiguration latency can easily be hidden behind software execution on the processor controlling the accelerator. This significantly reduces the amount of silicon area that must be dedicated to the accelerator, while imposing virtually no performance penalty compared to significantly larger accelerators that do not require reconfiguration.

international conference on computer aided design | 2009

Memory organization and data layout for instruction set extensions with architecturally visible storage

Panagiotis Athanasopoulos; Philip Brisk; Yusuf Leblebici; Paolo Ienne

Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). Adoption of the optimal ISE for an application would, in many cases, impose formidable cost increase in order to achieve the required data bandwidth. In this paper we propose a novel methodology for laying out data in memories, generating high-bandwidth memory systems by making use of existing low-bandwidth low-cost ones and designing custom functional units all with the desirable data bandwidth for only a fraction of the additional cost required by traditional techniques.

design automation conference | 2012