José L. Abellán | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José L. Abellán is active.

Explore More

Publication

Featured researches published by José L. Abellán.

design, automation, and test in europe | 2014

Thermal management of manycore systems with silicon-photonic networks

Tiansheng Zhang; José L. Abellán; Ajay Joshi; Ayse Kivilcim Coskun

Silicon-photonic network-on-chips (NoCs) provide high bandwidth density; therefore, they are promising candidates to replace electrical NoCs in manycore systems. The silicon-photonic NoCs, however, are sensitive to the temperature gradients that typically occur on the chip, and hence, require proactive thermal management. This paper first provides a design space exploration of silicon-photonic networks in manycore systems and quantifies the performance impact of the temperature gradients for various network bandwidths. The paper then introduces a novel job allocation technique that minimizes the temperature gradients among the ring modulators/filters to improve the application performance. Experimental results for a single-chip 256-core system demonstrate that our policy is able to maintain the maximum network bandwidth. Compared to existing workload allocation policies, the proposed policy improves system performance by up to 26.1% when running a single application and 18.3% for multi-program scenarios.

international conference on parallel processing | 2010

A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs

José L. Abellán; Juan C. Fernandez; Manuel E. Acacio

Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.

IEEE Transactions on Parallel and Distributed Systems | 2012

Efficient Hardware Barrier Synchronization in Many-Core CMPs

José L. Abellán; Juan Fernández; Manuel E. Acacio

Traditional software-based barrier implementations for shared memory parallel machines tend to produce hotspots in terms of memory and network contention as the number of processors increases. This could limit their applicability to future many-core CMPs in which possibly several dozens of cores would need to be synchronized efficiently. In this work, we develop GBarrier, a hardware-based barrier mechanism especially aimed at providing efficient barriers in future many-core CMPs. Our proposal deploys a dedicated G-line-based network to allow for fast and efficient signaling of barrier arrival and departure. Since GBarrier does not have any influence on the memory system, we avoid all coherence activity and barrier-related network traffic that traditional approaches introduce and that restrict scalability. Through detailed simulations of a 32-core CMP, we compare GBarrier against one of the most efficient software-based barrier implementations for a set of kernels and scientific applications. Evaluation results show average reductions of 54 and 21 percent in execution time, 53 and 18 percent in network traffic, and also 76 and 31 percent in the energy-delay2 product metric for the full CMP when the kernels and scientific applications, respectively, are considered.

networks on chips | 2015

Asymmetric NoC Architectures for GPU Systems

Amir Kavyan Ziabari; José L. Abellán; Yenai Ma; Ajay Joshi; David R. Kaeli

While both Chip MultiProcessors (CMPs) and Graphics Processing Units (GPUs) are many-core systems, they exhibit different memory access patterns. CMPs execute threads in parallel, where threads communicate and synchronize through the memory hierarchy (without any coalescing). GPUs on the other hand execute a large number of independent thread blocks and their accesses to memory are frequent and coalesced, resulting in a completely different access pattern. NoC designs for GPUs have not been extensively explored. In this paper, we first evaluate several NoC designs for GPUs to determine the most power/performance efficient NoCs. To improve NoC energy efficiency, we explore an asymmetric NoC design tailored for a GPUs memory access pattern, providing one network for L1-to-L2 communication and a second for L2-to-L1 traffic. Our analysis shows that an asymmetric multi-network Cmesh provides the most energy-efficient communication fabric for our target GPU system.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

Managing Laser Power in Silicon-Photonic NoC Through Cache and NoC Reconfiguration

Chao Chen; José L. Abellán; Ajay Joshi

In manycore systems, the silicon-photonic link technology is projected to replace electrical link technology for global communication in network-on-chip (NoC) as it can provide as much as an order of magnitude higher bandwidth density and lower data-dependent power. However, a large amount of fixed power is dissipated in the laser sources required to drive these silicon-photonic links, which negates any bandwidth density advantages. This large laser power dissipation depends on the number of on-chip silicon-photonic links, the bandwidth of each link, and the photonic losses along each link. In this paper, we propose to reduce the laser power dissipation at runtime by dynamically activating/deactivating L2 cache banks and switching ON/OFF the corresponding silicon-photonic links in the NoC. This method effectively throttles the total on-chip NoC bandwidth at runtime according to the memory access features of the applications running on the manycore system. Full-system simulation utilizing Princeton application repository for shared-memory computers and Stanford parallel applications for shared-memory-2 parallel benchmarks reveal that our proposed technique achieves on an average 23.8% (peak value 74.3%) savings in laser power, and 9.2% (peak value 26.9%) lower energy-delay product for the whole system at the cost of 0.65% loss (peak value 2.6%) in instructions per cycle on average when compared to the cases where all L2 cache banks are always active.

international conference on supercomputing | 2015

Leveraging Silicon-Photonic NoC for Designing Scalable GPUs

Amir Kavyan Ziabari; José L. Abellán; Rafael Ubal; Chao Chen; Ajay Joshi; David R. Kaeli

Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future many-core systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid silicon-photonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6 x (2.7 x on average) and reduces the energy-delay2 product (ED2P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9 x (43% on average) and achieve up to 62% reduction in ED2P (3% on average) in comparison to mesh design with best performance.

design, automation, and test in europe | 2012

Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs

José L. Abellán; Juan C. Fernandez; Manuel E. Acacio; Davide Bertozzi; Daniele Bortolotti; Andrea Marongiu; Luca Benini

Barrier synchronization is a key programming primitive for shared memory embedded MPSoCs. As the core count increases, software implementations cannot provide the needed performance and scalability, thus making hardware acceleration critical. In this paper we describe an interconnect extension implemented with standard cells and with a mainstream industrial toolflow. We show that the area overhead is marginal with respect to the performance improvements of the resulting hardware-accelerated barriers. We integrate our HW barrier into the OpenMP programming model and discuss synchronization efficiency compared with traditional software implementations.

computing frontiers | 2010

Efficient and scalable barrier synchronization for many-core CMPs

José L. Abellán; Juan C. Fernandez; Manuel E. Acacio

We present in this work a novel hardware-based barrier mechanism for synchronization on many-core CMPs. In particular, we leverage global interconnection lines (G-lines) and S-CSMA technique, which have been used to overcome some limitations of a flow control mechanism (EVC) in the context of Networks-on-Chip, to develop a simple G-lines-based network that operates independently of the main data network in order to carry out barrier synchronizations. Next, we evaluate our approach by running several applications on top of the Sim-PowerCMP performance simulator. Our method only takes 4 cycles to carry out the synchronization once all cores or threads have arrived at the barrier. Hence, we obtain much better performance results than software-based barrier implementations in terms of scalability and efficiency.

international conference on computational science | 2008

Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades

José L. Abellán; Juan C. Fernandez; Manuel E. Acacio

The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when we consider Dual Cell-Based Blade architectures where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided by dual Cell-based blades under varying workloads. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms which must be taken into account by programmers for improving the efficiency of their applications.

ACM Transactions on Architecture and Code Optimization | 2016

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

Amir Kavyan Ziabari; Yifan Sun; Yenai Ma; Dana Schaa; José L. Abellán; Rafael Ubal; John Kim; Ajay Joshi; David R. Kaeli

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

Explore More