Rafael Ubal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rafael Ubal is active.

Explore More

Publication

Featured researches published by Rafael Ubal.

international conference on parallel architectures and compilation techniques | 2012

Multi2Sim: a simulation framework for CPU-GPU computing

Rafael Ubal; Byunghyun Jang; Perhaad Mistry; Dana Schaa; David R. Kaeli

Accurate simulation is essential for the proper design and evaluation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing devices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an ×86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMDs OpenCL benchmark suite. Simulation capabilities are demonstrated with a preliminary architectural exploration study, and workload characterization examples. The project source code, benchmark packages, and a detailed users guide are publicly available at www.multi2sim.org.

symposium on computer architecture and high performance computing | 2007

Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors

Rafael Ubal; Julio Sahuquillo; Salvador Petit; Pedro López

Current microprocessors are based in complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need of evaluating new designs on each of these components motivates the development of tools which simulate the system working as a whole. In this paper, we present the Multi2Sim simulation framework, which models the major components of incoming systems, and is intended to cover the limitations of existing simulators. A set of simulation examples is also included for illustrative purposes.

international conference on intelligent pervasive computing | 2007

Leakage Current Reduction in Data Caches on Embedded Systems

Rafael Ubal; Julio Sahuquillo; Salvador Petit; Houcine Hassan; Pedro López

Nowadays, embedded systems can be found in a wide range of pervasive devices (e.g., smart phones, PDAs, or video/digital cameras). These devices contain large cache memories, whose power consumption can reach about 50% of the total spent energy, from which leakage energy is the predominant fraction in current technologies. This paper proposes a technique to reduce leakage energy consumption in data caches on embedded systems, which is based on the fact that most stored bits take a logical value of zero. The proposed technique has been evaluated on a model of a contemporary high-end embedded microprocessor, namely the ARM Cortex A8 processor, executing a set of standard embedded benchmarks. Experimental results show that leakage energy savings reach about 40% with no IPC loss.

international conference on supercomputing | 2015

Leveraging Silicon-Photonic NoC for Designing Scalable GPUs

Amir Kavyan Ziabari; José L. Abellán; Rafael Ubal; Chao Chen; Ajay Joshi; David R. Kaeli

Silicon-photonic link technology promises to satisfy the growing need for high bandwidth, low-latency and energy-efficient network-on-chip (NoC) architectures. While silicon-photonic NoC designs have been extensively studied for future many-core systems, their use in massively-threaded GPUs has received little attention to date. In this paper, we first analyze an electrical NoC which connects different cache levels (L1 to L2) in a contemporary GPU memory hierarchy. Evaluating workloads from the AMD SDK run on the Multi2sim GPU simulator finds that, apart from limits in memory bandwidth, an electrical NoC can significantly hamper performance and impede scalability, especially as the number of compute units grows in future GPU systems. To address this issue, we advocate using silicon-photonic link technology for on-chip communication in GPUs, and we present the first GPU-specific analysis of a cost-effective hybrid photonic crossbar NoC. Our baseline is based on an AMD Southern Islands GPU with 32 compute units (CUs) and we compare this design to our proposed hybrid silicon-photonic NoC. Our proposed photonic hybrid NoC increases performance by up to 6 x (2.7 x on average) and reduces the energy-delay2 product (ED2P) by up to 99% (79% on average) as compared to conventional electrical crossbars. For future GPU systems, we study an electrical 2D-mesh topology since it scales better than an electrical crossbar. For a 128-CU GPU, the proposed hybrid silicon-photonic NoC can improve performance by up to 1.9 x (43% on average) and achieve up to 62% reduction in ED2P (3% on average) in comparison to mesh design with best performance.

design automation conference | 2014

Exploring the Heterogeneous Design Space for both Performance and Reliability

Rafael Ubal; Dana Schaa; Perhaad Mistry; Xiang Gong; Yash Ukidave; Zhongliang Chen; Gunar Schirner; David R. Kaeli

As we move into a new era of heterogeneous multi-core systems, our ability to tune the performance and understand the reliability of both hardware and software becomes more challenging. Given the multiplicity of different design trade-offs in hardware and software, and the rate of introduction of new architectures and hardware/-software features, it becomes difficult to properly model emerging heterogeneous platforms. In this paper we present a new methodology to address these challenges in a flexible and extensible framework. We describe the design of a framework that supports a range of heterogeneous devices to be evaluated based on different performance/reliability criteria. We address heterogeneity both in hardware and software, providing a flexible framework that can be easily adapted and extended as new elements in the SoC stack continue to evolve. Our framework enables modeling at different levels of abstraction and interfaces to existing tools to compose hybrid modeling environments. We also consider the role of software, providing a flexible and modifiable compiler stack based on LLVM. We provide examples that highlight both the flexibility of this framework and demonstrate the utility of the tools.

ACM Transactions on Architecture and Code Optimization | 2016

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

Amir Kavyan Ziabari; Yifan Sun; Yenai Ma; Dana Schaa; José L. Abellán; Rafael Ubal; John Kim; Ajay Joshi; David R. Kaeli

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

international parallel and distributed processing symposium | 2008

The impact of out-of-order commit in coarse-grain, fine-grain and simultaneous multithreaded architectures

Rafael Ubal; Julio Sahuquillo; Salvador Petit; Pedro López; José Duato

Multithreaded processors in their different organizations (simultaneous, coarse grain and fine grain) have been shown as effective architectures to reduce the issue waste. On the other hand, retiring instructions from the pipeline in an out-of-order fashion helps to unclog the ROB when a long latency instruction reaches its head. This further contributes to maintain a higher utilization of the available issue bandwidth. In this paper, we evaluate the impact of retiring instructions out of order on different multithreaded architectures and different instruction fetch policies, using the recently proposed Validation Buffer microarchitecture as baseline out-of-order commit technique. Experimental results show that, for the same performance, out-of-order commit permits to reduce multithread hardware complexity (e.g., fine grain multithreading with a lower number of supported threads).

international symposium on performance analysis of systems and software | 2017

Multi2Sim Kepler: A detailed architectural GPU simulator

Xun Gong; Rafael Ubal; David R. Kaeli

Presilicon simulation is one of the key toolsets for computer architects to evaluate and optimize their future designs. As Graphics Processing Units (GPUs) have become the platform of choice in many computing communities due to their impressive processing capabilities, computer architecture researchers need a simulation framework that allows them to quantitatively consider design tradeoffs. In this paper, we present the Multi2Sim Kepler simulator framework, a new detailed GPU microarchitecture performance simulator that supports NVIDIAs Kepler shader assembly (SASS) code execution. The toolset provides a disassembler, a functional simulator and a detailed cycle-based simulator. We provide insight into the architecture of the NVIDIA Kepler GPU, describing the details of the streaming multiprocessor, front-end and instruction pipelines. We compare the performance of this new simulator against an NVIDIA K20X, a high-end Kepler device. We also evaluate the performance of NVIDIAs CUDA benchmark suite on our GPU performance simulator.

digital systems design | 2009

An Efficient Low-Complexity Alternative to the ROB for Out-of-Order Retirement of Instructions

Salvador Petit; Rafael Ubal; Julio Sahuquillo; Pedro López; José Duato

Current superscalar processors use a Reorder Buffer (ROB) to support speculation, precise exceptions, and register reclamation. Instructions are retired from this structure in program order, which may lead to significant performance degradation if a long latency operation blocks the ROB head. In this paper, a checkpoint-free out-of-order commit architecture is proposed, which replaces the ROB with a small structure called Validation Buffer (VB) from which instructions are retired as soon as their speculative state is resolved. An aggressive register reclamation mechanism targeted to this microarchitecture is also devised. Experimental results show that the VB microarchitecture is much more efficient than a ROB-based microprocessor. For example, a 32-entry VB provides similar performance to a 256-entry ROB, while reducing the utilization of other major processor structures.

international conference on parallel architectures and compilation techniques | 2007

VB-MT: Design Issues and Performance of the Validation Buffer Microarchitecture for Multithreaded Processors

Rafael Ubal; Julio Sahuquillo; Salvador Petit; Pedro López; José Duato

The validation buffer (VB) Microarchitecture retires instructions out of order, by substituting the classical ROB by the VB structure. The VB removes the negative effect of long latency instructions located at the ROB head, which prevent other instructions from retiring and cause frequent pipeline stalls due to lack of space in the ROB. This work analyzes different multithreading models (coarse grain, fine grain and simultaneous multithreading) and a set of different instruction fetch policies.

Explore More