Is this you? Create Your Porfile

Miquel Pericàs

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Miquel Pericàs is active.

Explore More

Publication

Featured researches published by Miquel Pericàs.

applied reconfigurable computing | 2012

PPMC: a programmable pattern based memory controller

Tassadaq Hussain; Muhammad Shafiq; Miquel Pericàs; Nacho Navarro; Eduard Ayguadé

One of the main challenges in the design of hardware accelerators is the efficient access of data from the external memory. Improving and optimizing the functionality of the memory controller between the external memory and the accelerators is therefore critical. In this paper, we advance toward this goal by proposing PPMC, the Programmable Pattern-based Memory Controller. This controller supports scatter-gather and strided 1D, 2D and 3D accesses with programmable tiling. Compared to existing solutions, the proposed system provides better performance, simplifies programming access patterns and eases software integration by interfacing to high-level programming languages. In addition, the controller offers an interface for automating domain decomposition via tiling. We implemented and tested PPMC on a Xilinx ML505 evaluation board using a MicroBlaze soft-core as the host processor. The evaluation uses six memory intensive application kernels: Laplacian solver, FIR, FFT, Thresholding, Matrix Multiplication, and 3D-Stencil. The results show that the PPMC-enhanced system achieves at least 10x speed-ups for 1D, 2D and 3D memory accesses as compared to a non-PPMC based setup.

field-programmable technology | 2009

Exploiting memory customization in FPGA for 3D stencil computations

Muhammad Shafiq; Miquel Pericàs; Raúl de la Cruz; Mauricio Araya-Polo; Nacho Navarro; Eduard Ayguadé

3D stencil computations are compute-intensive kernels often appearing in high-performance scientific and engineering applications. The key to efficiency in these memory-bound kernels is full exploitation of data reuse. This paper explores the design aspects for 3D-Stencil implementations that maximize the reuse of all input data on a FPGA architecture. The work focuses on the architectural design of 3D stencils with the form n × (n + 1) × n, where n = {2, 4, 6, 8, ...}. The performance of the architecture is evaluated using two design approaches, ¿Multi-Volume¿ and ¿Single-Volume¿. When n = 8, the designs achieve a sustained throughput of 55.5 GFLOPS in the ¿Single-Volume¿ approach and 103 GFLOPS in the ¿Multi-Volume¿ design approach in a 100-200 MHz multi-rate implementation on a Virtex-4 LX200 FPGA. This corresponds to a stencil data delivery of 1500 bytes/cycle and 2800 bytes/cycle respectively. The implementation is analyzed and compared to two CPU cache approaches and to the statically scheduled local stores on the IBM PowerXCell 8i. The FPGA approaches designed here achieve much higher bandwidth despite the FPGA device being the least recent of the chips considered. These numbers show how a custom memory organization can provide large data throughput when implementing 3D stencil kernels.

field-programmable technology | 2011

Implementation of a Reverse Time Migration kernel using the HCE High Level Synthesis tool

Tassadaq Hussain; Miquel Pericàs; Nacho Navarro; Eduard Ayguadé

Reconfigurable computers have started to appear in the HPC landscape, albeit at a slow pace. Adoption is still being hindered by the design methodologies and slow implementation cycles. Recently, methodologies based on High Level Synthesis (HLS) have begun to flourish and the reconfigurable supercomputing community is slowly adopting these techniques. In this paper we took a geophysics application and implemented it on FPGA using a HLS tool called HCE. The application, Reverse Time Migration, is an important code for subsalt imaging. It is also a highly demanding code both in computationally as in its memory requirements. The complexity of this code makes it challenging to implement it using a HLS methodology instead of HDL. We study the achieved performance and compare it with hand-written HDL and also with software based execution. The resulting design, when implemented on the Altera Stratix IV EP4SGX230 and EP4SGX530 devices achieves 11.2 and 22 GFLOPS respectively. On these devices, the design was capable of achieving up to 4.2× and 7.9× improvement, respectively, over a general purpose processor core (Intel i7).

international conference on supercomputing | 2005

An asymmetric clustered processor based on value content

Ruben Gonzalez; Adrián Cristal; Miquel Pericàs; Mateo Valero; Alexander V. Veidenbaum

This paper proposes a new organization for clustered processors. Such processors have many advantages, including improved implementability and scalability, reduced power, and, potentially, faster clock speed. Difficulties lie in assigning instructions to clusters (steering) so as to minimize the effect of inter-cluster communication latency. The asymmetric clustered architecture proposed in this paper aims to increase the IPC and reduce power consumption by using two different types of integer clusters and a new steering algorithm. One type is a standard, 64b integer cluster, while the other is a very narrow, 20b cluster. The narrow cluster runs at twice the clock rate of the standard cluster.A new instruction steering mechanism is proposed to increase the use of the fast, narrow cluster as well as to minimize inter-cluster communication. Steering is performed by a history-based predictor, which is shown to be 98% accurate.The proposed architecture is shown to have a higher average IPC than its un-clustered equivalent for a four-wide issue processor, something that has never been achieved by previously proposed clustered organizations. Overall, a 3% increase in average IPC over an un-clustered design and a 8% over a symmetric cluster with dependence based steering are achieved for a 2-cycle intercluster communication latency.Part of the reason for higher IPC is the ability of the new architecture to execute most of the address computations as narrow, fast operations. The new architecture exploits its early knowledge of partial address values to achieve a 0-cycle address translation for 90% of all address computations, further improving performance.

symposium on application specific processors | 2011

TARCAD: A template architecture for reconfigurable accelerator designs

Muhammad Shafiq; Miquel Pericàs; Nacho Navarro; Eduard Ayguadé

In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.

high performance computing for computational science (vector and parallel processing) | 2008

Vectorized AES Core for High-throughput Secure Environments

Miquel Pericàs; Ricardo Chaves; Georgi Gaydadjiev; Stamatis Vassiliadis; Mateo Valero

Parallelism has long been used to increase the throughput of applications that process independent data. With the advent of multicore technology designers and programmers are increasingly forced to think in parallel. In this paper we present the evaluation of an encryption core capable of handling multiple data streams. The design is oriented towards future scenarios for internet, where throughput capacity requirements together with privacy and integrity will be critical for both personal and corporate users. To power such scenarios we present a technique that increases the efficiency of memory bandwidth utilization of cryptographic cores. We propose to feed cryptographic engines with multiple streams to better exploit the available bandwidth. To validate our claims, we have developed an AES core capable of encrypting two streams in parallel using either ECB or CBC modes. Our AES core implementation consumes trivial amount of resources when a Virtex-II Pro FPGA device is targeted.

ieee international conference on high performance computing data and analytics | 2004

High-performance and low-power VLIW cores for numerical computations

Miquel Pericàs; Eduard Ayguadé; Javier Zalamea; Josep Llosa; Mateo Valero

Issue logic is among the worst scaling structures in a modern microprocessor. Increasing the issue width increments the processor area in an exponential way. Bigger processors will have inherently larger wire delays. In this scenario, technology scaling will yield smaller performance improvements as the wire delays do not decrease. Instead, they start to dominate the clock cycle. In order to offer higher performance the wire problem needs to be tackled. This paper discusses two methods which attempt to move the wire problem out of the critical path. The first method is the clustering technique, which directly approaches the wire problem by combining several smaller execution cores in the processor backend to perform the computations. Each core has a smaller issue width and a much smaller area. The second technique we study is the widening technique. This technique consists in reducing the issue width of the processor, but giving the instructions SIMD capabilities. The parallelism here is small (normally two to four) and does not resemble multimedia or vector extensions. Wide processors use wide functional units that compute the same operation on multiple words. The rationale behind this idea is that by reducing the issue width (but not the computational bandwidth), we are also reducing the issue logic circuitry and the complexity of structures such as the register file and the cache memory. When compared with a centralised core with 128 registers, 8 FPUs and 4 memory ports, our approach, using an equivalent amount of hardware units, is able to achieve speedups up to 1.7.

ieee international conference on high performance computing data and analytics | 2003

Power-Performance Trade-Offs in Wide and Clustered VLIW Cores for Numerical Codes

Miquel Pericàs; Eduard Ayguadé; Javier Zalamea; Josep Llosa; Mateo Valero

Instruction-Level Parallelism (ILP) is the main source of performance achievable in numerical applications. Architecturalresources and program recurrences are the main limitations to the amount of ILP exploitable from loops, the most time-consuming part in numerical computations. In order to increase the issue rate, current designs use growing degrees of resource replication for memory ports and functional units. But the high costs in terms of power, area and clock cycle of this technique are making it less attractive.

irregular applications: architectures and algorithms | 2011

Implementation of a hierarchical N-body simulator using the Ompss programming model

Miquel Pericàs; Xavier Martorell; Yoav Etsion

Many HPC algorithms are highly irregular. They have input-dependent control flow and operate on pointer-based data structures such as trees, graphs, or linked lists. This irregularity makes it challenging to parallelize such algorithms in order to efficiently run them on modern HPC systems. In this paper we study the architectural and programming bottlenecks of the OmpSs task-based programming model when implementing irregular applications. We select a sequential N-body simulation code and describe its parallelization using OmpSs. We then analyze the code, focusing on scalability and load balancing. We conclude that, in general, task-based programming models are well suited to the exploitation of irregular parallelism. Nevertheless, in order to avoid the overheads associated with manually managing the load balancing, the hardware and runtime will need to collectively support much finer-grained tasks.

ieee international conference on high performance computing data and analytics | 2005

Decoupled state-execute architecture

Miquel Pericàs; Adrián Cristal; Ruben Gonzalez; Alexander V. Veidenbaum; Mateo Valero

The majority of register file designs follow one of two well-known approaches.Manymodern high-performance processors (POWER4 [1], Pentium4 [2]) use a merged register file that holds both architectural and rename registers. Other processors use a Future File (eg, Opteron [3]) with rename registers kept separately in reservation stations. Both approaches have issues that may limit their application in futuremicroprocessors. The merged register file scales poorly in terms of powerperformance while the Future File has to pay a large penalty due on branch mis-prediction recovery. In addition, the Future File requires the use of the less scalable mechanism of reservation stations. This paper proposes to combine the best aspects of the traditional Future File architecture with those of the merged physical register file. The key point is that the new architecture separates the processor state, in particular the registers, and the execution units in the pipeline back-end. Therefore it is called Decoupled State-Execute Architecture. The resulting register file can be accessed in the pipeline front-end and has several desirable properties that allow efficient application of several optimizations, most notably the register file banking and a novel writeback filtering mechanism. As a result, only a 1.0% IPC degradation was observed with aggressive banking and the energy consumption was lowered by the new writeback filtering technique. Together, the two optimizations remove approximately 80% of the energy consumed in register file data array.

Explore More