Is this you? Create Your Porfile

Oscar Palomar

Barcelona Supercomputing Center

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oscar Palomar is active.

Explore More

Publication

Featured researches published by Oscar Palomar.

international symposium on quality electronic design | 2014

PETS: Power and energy estimation tool at system-level

Santhosh-Kumar Rethinagiri; Oscar Palomar; Osman S. Unsal; Adrian Cristal; Rabie Ben-Atitallah; Smail Niar

In this paper, we introduce PETS, a simulation based tool to estimate, analyse and optimize power/energy consumption of an application running on complex state-of-the-art heterogeneous embedded processor based platforms. This tool is integrated with power and energy models in order to support comprehensive design space exploration for low power multi-core and heterogeneous multiprocessor platforms such as OMAP, CARMA, Zynq 7000 and Virtex II Pro. Moreover, PETS is equipped with power optimization techniques such as dynamic slack reduction and work load balancing. The development of PETS involves two steps. First step: power model generation. For the power model development, functional-level parameters are used to set up generic power models for the different components of the system. So far, seven power models have been developed for different architectures, starting from the simple low power architecture ARM9 to the very complex DSP TI C64x. Second step: a simulation based virtual platform framework is developed using SystemC IPs and JIT/ISS compilers to accurately grab the activities to estimate power. The accuracy of our proposed tool is evaluated by using a variety of industrial benchmarks. Estimated power and energy values are compared to real board measurements. The power estimation results are less than 4% of error for single core processor, 4.6% for dual-core processor, 5% for quad-core, 6.8% multi-processor based system and effective optimisation of power/energy for the applications.

high performance computing systems and applications | 2014

Advanced Pattern based Memory Controller for FPGA based HPC applications

Tassadaq Hussain; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Mateo Valero

The ever-increasing complexity of high-performance computing applications limits performance due to memory constraints in FPGAs. To address this issue, we propose the Advanced Pattern based Memory Controller (APMC), which supports both regular and irregular memory patterns. The proposed memory controller systematically reduces the latency faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth by using a smart mechanism that collects and stores the different patterns and reuses them when it is needed. In order to prove the effectiveness of the proposed controller, we implemented and tested it on a Xilinx ML505 FPGA board. In order to prove that our controller is efficient in a variety of scenarios, we used several benchmarks with different memory access patterns. The benchmarking results show that our controller consumes 20% less hardware resources, 32% less on chip power and achieves a maximum speedup of 52× and 2.9× for regular and irregular applications respectively.

international symposium on microarchitecture | 2012

Vector Extensions for Decision Support DBMS Acceleration

Timothy Hayes; Oscar Palomar; Osman S. Unsal; Adrian Cristal; Mateo Valero

Database management systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centres. As a result of this criticality, efficient execution of DBMS engines has become an important area of investigation. This work takes a top-down approach to accelerating decision support systems (DSS) on x86-64 microprocessors using vector ISA extensions. In the first step, a leading DSS DBMS is analysed for potential data-level parallelism. We discuss why the existing multimedia SIMD extensions (SSE/AVX) are not suitable for capturing this parallelism and propose a complementary instruction set reminiscent of classical vector architectures. The instruction set is implemented using unintrusive modifications to a modern x86-64 micro architecture tailored for DSS DBMS. The ISA and micro architecture are evaluated using a cycle-accurate x86-64 micro architectural simulator coupled with a highly-detailed memory simulator. We have found a single operator is responsible for 41% of total execution time for the TPC-H DSS benchmark. Our results show performance speedups between 1.94x and 4.56x for an implementation of this operator run with our proposed hardware modifications.

applied reconfigurable computing | 2014

Stand-alone memory controller for graphics system

Tassadaq Hussain; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Mateo Valero; Amna Haider

There has been a dramatic increase in the complexity of graphics applications in System-on-Chip (SoC) with a corresponding increase in performance requirements. Various powerful and expensive platforms to support graphical applications appeared recently. All these platforms require a high performance core that manages and schedules the high speed data of graphics peripherals (camera, display, etc.) and an efficient on chip scheduler. In this article we design and propose a SoC based Programmable Graphics Controller (PGC) that handles graphics peripherals efficiently. The data access patterns are described in the program memory; the PGC reads them, generates transactions and manages both bus and connected peripherals without the support of a master core. The proposed system is highly reliable in terms of cost, performance and power. The PGC based system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the PGC is compared with the Microblaze processor based graphic system. When compared with the baseline system, the results show that the PGC captures video at 2x of higher frame rate and achieves 3.4x to 7.4x of speedup while processing images. PGC consumes 30% less hardware resources and 22% less on-chip power than the baseline system.

power and timing modeling optimization and simulation | 2014

VPPET: Virtual platform power and energy estimation tool for heterogeneous MPSoC based FPGA platforms

Santhosh Kumar Rethinagiri; Oscar Palomar; Javier Arias Moreno; Osman S. Unsal; Adrián Cristal

Using low-power symmetric multi-cores on FPGAs are becoming ubiquitous in embedded computing. This is due to the emergence of power and energy as key design metrics, as important as performance. This leads to the requirement of powerful and reliable tools, which will be used for the Design Space Exploration (DSE) based on power and energy at an early stage of the design flow. In this paper, we propose a simulation based virtual platform power and energy estimation tool for heterogeneous Multiprocessor System-on-Chip (MPSoC) based platforms. There are two steps involved in this tool development. The first step is power model generation. For the power model development, we used functional parameters to set up generic power models for different parts of the system. This is a one-time activity. In the second step, a simulation based virtual platform framework is developed to accurately grab the activities used in the related power models generated in the first step. The combination of the two steps leads to a hybrid power estimation, which gives a better trade-off between accuracy and speed. The proposed tool is automated and also scalable for exploring complex embedded multi-core architectures. The efficiency of the proposed tool is validated through multi-cores/processors designed around the FPGAs and extended to accommodate futuristic multi-processors/cores for a reliable energy based DSE. The obtained power/energy estimation results provide less than 4% of error for single core processor, 8% for dual-core processor and 9% for heterogeneous MPSoC based systems when compared to real board measurements.

high-performance computer architecture | 2015

VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors

Timothy Hayes; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Mateo Valero

Sorting is a widely studied problem in computer science and an elementary building block in many of its subfields. There are several known techniques to vectorise and accelerate a handful of sorting algorithms by using single instruction-multiple data (SIMD) instructions. It is expected that the widths and capabilities of SIMD support will improve dramatically in future microprocessor generations and it is not yet clear whether or not these sorting algorithms will be suitable or optimal when executed on them. This work extrapolates the level of SIMD support in future microprocessors and evaluates these algorithms using a simulation framework. The scalability, strengths and weaknesses of each algorithm are experimentally derived. We then propose VSR sort, our own novel vectorised non-comparative sorting algorithm based on radix sort. To facilitate the execution of this algorithm we define two new SIMD instructions and propose a complementary hardware structure for their execution. Our results show that VSR sort has maximum speedups between 14.9x and 20.6x over a scalar baseline and an average speedup of 3.4x over the next-best vectorised sorting algorithm.

Microprocessors and Microsystems | 2015

ParaDIME: Parallel Distributed Infrastructure for Minimization of Energy for data centers

Santhosh Kumar Rethinagiri; Oscar Palomar; Anita Sobe; Gulay Yalcin; Thomas Knauth; J. Rubén Titos Gil; Pablo Prieto; Adrian Cristal; Osman Unsal; Pascal Felber; Christof Fetzer; Dragomir Milojevic

Abstract Dramatic environmental and economic impact of the ever increasing power and energy consumption of modern computing devices in data centers is now a critical challenge. On the one hand, designers use technology scaling as one of the methods to face the phenomenon called dark silicon (only segments of a chip function concurrently due to power restrictions). On the other hand, designers use extreme-scale systems such as teradevices to meet the performance needs of their applications which in turn increases the power consumption of the platform. In order to overcome these challenges, we need novel computing paradigms that address energy efficiency. One of the promising solutions is to incorporate parallel distributed methodologies at different abstraction levels. The FP7 project ParaDIME focuses on this objective to provide different distributed methodologies (software–hardware techniques) at different abstraction levels to attack the power-wall problem. In particular, the ParaDIME framework will utilize: circuit and architecture operation below safe voltage limits for drastic energy savings, specialized energy-aware computing accelerators, heterogeneous computing, energy-aware runtime, approximate computing and power-aware message passing. The major outcome of the project will be a noval processor architecture for a heterogeneous distributed system that utilizes future device characteristics, runtime and programming model for drastic energy savings of data centers. Wherever possible, ParaDIME will adopt multidisciplinary techniques, such as hardware support for message passing, runtime energy optimization utilizing new hardware energy performance counters, use of accelerators for error recovery from sub-safe voltage operation, and approximate computing through annotated code. Furthermore, we will establish and investigate the theoretical limits of energy savings at the device, circuit, architecture, runtime and programming model levels of the computing stack, as well as quantify the actual energy savings achieved by the ParaDIME approach for the complete computing stack with the real environment.

field programmable logic and applications | 2014

MAPC: Memory access pattern based controller

Tassadaq Hussain; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Mateo Valero

Traditionally, system designers have attempted to improve system performance by scheduling the processing cores and by exploring different memory system configurations and there is comparatively less work done scheduling the accesses at the memory system level and exploring data accesses on the memory systems. In this paper, we propose a memory access pattern based controller (MAPC). MAPC organizes data accesses in descriptors, prioritizes them with respect to the number and size of transfer requests. When compared to the baseline multicore system, the MAPC based system achieves between 2.41× to 5.34× of speedup for different applications, consumes 28% less hardware resources and 13% less dynamic power.

reconfigurable computing and fpgas | 2013

ReCompAc: Reconfigurable compute accelerator

Milovan Duric; Oscar Palomar; Aaron Smith

This paper presents a promising technique for accelerating frequently executed (hot) code regions. Unlike most accelerators which utilize dedicated hardware structures, our architecture exploits the available execution resources of a chip multiprocessor (CMP), while dynamically specializing into a reconfigurable compute accelerator. Execution units available in one or more general purpose cores are placed close to each other on the chip. An additional reconfigurable switched network is used to couple these units and to specialize the functionality of the accelerator for the hot code regions. The available general purpose hardware in cores executes memory instructions, feeds and controls the accelerator. Moreover, the CMP supports core fusion, and while fusing cores resources it allows for more aggressive memory processing for memory intensive applications. The dynamic specialization of available execution resources increases the performance and efficiency of CMP, while requiring minimal hardware modifications. Our initial results indicate a speedup of 7× compared to a general purpose execution.

international conference on embedded computer systems architectures modeling and simulation | 2014

Dynamic-vector execution on a general purpose EDGE chip multiprocessor

Milovan Duric; Oscar Palomar; Aaron Smith; Milan Stanic; Osman S. Unsal; Adrián Cristal; Mateo Valero; Doug Burger; Alexander V. Veidenbaum

This paper proposes a cost-effective technique that morphs the available cores of a low power chip multiprocessor (CMP) into an accelerator for data parallel (DLP) workloads. Instead of adding a special-purpose vector architecture as an accelerator, our technique leverages the resources of each CMP core to mimic the functionality of a vector processor. The morphing provides dynamic vector execution (DVX) on a general purpose CMP, by adding minimal hardware for vector control. DVX enhances the vector execution by dynamically configuring the allocation of compute and memory resources to match particular workload requirements. As an energy efficient substrate, we utilize modest dual issue cores based on an Explicit Data Graph Execution (EDGE) architecture. The results show that a DVX enabled 4-core EDGE CMP improves the energy-delay product over 14x, at the cost of only 1.1% of additional area. We compare DVX against a CMP that adds a dedicated DLP accelerator based on a conventional high performance vector design. The vector accelerator increases the area footprint over 74%, which greatly affects the cost of the modest processor. DVX avoids the additional costs and yet gains over 86% of the speedup obtained with the dedicated accelerator.

Explore More