Emilio Castillo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Emilio Castillo is active.

Explore More

Publication

Featured researches published by Emilio Castillo.

european conference on parallel processing | 2015

Runtime-aware architectures

Marc Casas; Miquel Moreto; Lluc Alvarez; Emilio Castillo; Dimitrios Chasapis; Timothy Hayes; Luc Jaulmes; Oscar Palomar; Osman S. Unsal; Adrián Cristal; Eduard Ayguadé; Jesús Labarta; Mateo Valero

In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores face. The runtime system of the parallel programming model has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In the paper, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective.

field-programmable logic and applications | 2008

Cluster architecture based on low cost reconfigurable hardware

Cesar Pedraza; Emilio Castillo; Javier Castillo; Cristobal Camarero; José Luis Bosque; José Ignacio Martínez; Rafael Menendez

The SMILE project accelerates scientific and industrial applications by means of a cluster of low-cost FPGA boards. With this approach the intensive calculation tasks are accelerated using the FPGA logic, while the communication patterns of the applications remains unchanged by using a Message Passing Library over Linux. This paper explains the cluster architecture: the SMILE nodes and the developed high-speed communication network for the FPGA RocketIO interfaces. A SystemC model developed to simulate the cluster is also detailed. In order to show the potential of the SMILE proposal a Content-Based Information Retrieval parallel application has been developed and compared with a HP cluster architecture in terms of response time andpower consumption.

international parallel and distributed processing symposium | 2009

Hardware accelerated montecarlo financial simulation over low cost FPGA cluster

Javier Castillo; José Luis Bosque; Emilio Castillo; Pablo Huerta; José Ignacio Martínez

The use of computational systems to help making the right investment decisions in financial markets is an open research field where multiple efforts have being carried out during the last few years. The ability of improving the assessment process and being faster than the rest of the players is one of the keys for the success on this competitive scenario. This paper explores different options to accelerate the computation of the option pricing problem (supercomputer, FPGA cluster or GPU) using the Montecarlo method to solve the Black-Scholes formula, and presents a quantitative study of their performance and scalability.

The Journal of Supercomputing | 2015

Financial applications on multi-CPU and multi-GPU architectures

Emilio Castillo; Cristobal Camarero; Ana Borrego; José Luis Bosque

The use of high-performance computing systems to help to make the right investment decisions in financial markets is an open research field where multiple efforts have being carried out during the past few years. Specifically, the Heath–Jarrow–Morton (HJM) model has a number of features that make it well suited for implementation on massively parallel architectures. This paper presents a multi-CPU and multi-GPU implementation of the HJM model that improves both the performance and energy efficiency. The experimental results reveal that the proposed architectures achieve excellent performance improvements, as well as optimize the energy efficiency and the cost/performance ratio.

international conference on parallel architectures and compilation techniques | 2015

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures

Lluc Alvarez; Miquel Moreto; Marc Casas; Emilio Castillo; Xavier Martorell; Jesús Labarta; Eduard Ayguadé; Mateo Valero

The increasing number of cores and the anticipated level of heterogeneity in upcoming multicore architectures cause important problems in traditional cache hierarchies. A good way to alleviate these problems is to add scratchpad memories alongside the cache hierarchy, forming a hybrid memory hierarchy. This memory organization has the potential to improve performance and to reduce the power consumption and the on-chip network traffic, but exposing such a complex memory model to the programmer has a very negative impact on the programmability of the architecture. Emerging task-based programming models are a promising alternative to program heterogeneous multicore architectures. In these models the runtime system manages the execution of the tasks on the architecture, allowing them to apply many optimizations in a generic way at the runtime system level. This paper proposes giving the runtime system the responsibility to manage the scratchpad memories of a hybrid memory hierarchy in multicore processors, transparently to the programmer. In the envisioned system, the runtime system takes advantage of the information found in the task dependences to map the inputs and outputs of a task to the scratchpad memory of the core that is going to execute it. In addition, the paper exploits two mechanisms to overlap the data transfers with computation and a locality-aware scheduler to reduce the data motion. In a 32-core multicore architecture, the hybrid memory hierarchy outperforms cache-only hierarchies by up to 16%, reduces on-chip network traffic by up to 31% and saves up to 22% of the consumed power.

digital systems design | 2013

Advanced Switching Mechanisms for Forthcoming On-Chip Networks

Emilio Castillo; Cristobal Camarero; Esteban Stafford; Fernando Vallejo; José Luis Bosque; Ramón Beivide

Many current VLSI on-chip multiprocessors and systems-on-chip employ point-to-point switched interconnection networks. Rings and 2D-meshes are among the most popular interconnection topologies for these increasingly important onchip networks. Nevertheless, rings cannot scale beyond dozens of nodes and meshes are asymmetric. Two of the key features of square 2D-tori are their scalability and symmetry. As higher scalability is demanded by the increasing number of cores (or specialized units) integrated on a chip and symmetry is critical for high-performance and load balancing, we concentrate on 2D-tori. However, most popular deadlock-free routing mechanisms are based on Dimension Order Routing (DOR) which breaks the torus symmetry when managing adversarial traffic patterns. This paper analyzes this problem and its consequences. After that, it proposes a new deadlock-free fully adaptive minimal routing, denoted as σDOR, that preserves tori symmetry under any load. It uses just two virtual channels to avoid DOR-induced asymmetry, the same as in previous competitive proposals. σDOR exhibits better behavior than any of previous solutions as it allows packets to dynamically adapt to local congestion. Experimental results evidence the superior performance of our mechanism, confirming the negative impact of DOR asymmetry.

Archive | 2013

Low Cost High Performance Reconfigurable Computing

Javier Castillo; José Luis Bosque; Cesar Pedraza; Emilio Castillo; Pablo Huerta; José Ignacio Martínez

High Performance Reconfigurable Computing (HPRC) has emerged as an alternative way to accelerate applications using FPGAs. Although these HPRC systems have a performance comparable to standard supercomputers and at a much lower cost, HPRC systems are still not affordable for many institutions. We present a low-cost HPRC system built on standard FPGA boards with an architecture that can execute many scientific applications faster than in Graphical Processor Units and traditional supercomputers. The system is made up of 32 low-cost FPGA boards and a custom-made high-speed network interface using RocketIO interfaces. We have designed a SystemC methodology and CAD framework that allow the designer to simulate any MPI scientific application before generating the final implementation files. The software runs on the PowerPC processor embedded in the FPGA on a light ad-hoc implementation of MPI, and the hardware is automatically translated from SystemC to Verilog, and connected to the PowerPC. This makes the SMILE HPRC system fully compatible with any existing MPI application. The proof of the concept of the SMILE HPRC has been exhaustively tested with two complex and demanding applications: the Monte Carlo financial simulation and the Boolean Synthesis using Genetic Algorithms. The results show a remarkable performance, reasonable costs, small power consumption, no need of cooling systems, small physical space requirements, system scalability and software portability.

Journal of Systems Architecture | 2010

Content-based image retrieval algorithm acceleration in a low-cost reconfigurable FPGA cluster

Cesar Pedraza; Emilio Castillo; Javier Castillo; José Luis Bosque; José Ignacio Martínez; Oscar David Robles; Javier Cano; Pablo Huerta

The SMILE project main aim is to build an efficient low-cost cluster based on FPGA boards in order to take advantage of its reconfigurable capabilities. This paper shows the cluster architecture, describing: the SMILE nodes, the high-speed communication network for the nodes and the software environment. Simulating complex applications can be very hard, therefore a SystemC model of the whole system has been designed to simplify this task and provide error-free downloading and execution of the applications in the cluster. The hardware-software co-design process involved in the architecture and SystemC design is presented as well. The SMILE cluster functionality is tested executing a real complex Content-Based Information Retrieval (CBIR) parallel application and the performance of the cluster is compared (time, power and cost) with a traditional cluster approach.

international parallel and distributed processing symposium | 2016

CATA: Criticality Aware Task Acceleration for Multicore Processors

Emilio Castillo; Miquel Moreto; Marc Casas; Lluc Alvarez; Enrique Vallejo; Kallia Chronaki; Rosa M. Badia; José Luis Bosque; Ramón Beivide; Eduard Ayguadé; Jesús Labarta; Mateo Valero

Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities in future manycore systems. Criticality aware task schedulers can benefit from these opportunities by scheduling tasks to the most appropriate cores. However, these schedulers may suffer from priority inversion and static binding problems that limit their expected improvements. Based on the observation that task criticality information can be exploited to drive hardware reconfigurations, we propose a Criticality Aware Task Acceleration (CATA) mechanism that dynamically adapts the computational power of a task depending on its criticality. As a result, CATA achieves significant improvements over a baseline static scheduler, reaching average improvements up to 18.4% in execution time and 30.1% in Energy-Delay Product (EDP) on a simulated 32-core system. The cost of reconfiguring hardware by means of a software-only solution rises with the number of cores due to lock contention and reconfiguration overhead. Therefore, novel architectural support is proposed to eliminate these overheads on future manycore systems. This architectural support minimally extends hardware structures already present in current processors, which allows further improvements in performance with negligible overhead. As a consequence, average improvements of up to 20.4% in execution time and 34.0% in EDP are obtained, outperforming state-of-the-art acceleration proposals not aware of task criticality.

ieee international conference on high performance computing data and analytics | 2012

King Topologies for Fault Tolerance

Esteban Stafford; Emilio Castillo; Fernando Vallejo Jose Luis Bosque; Carmen Martínez; Cristobal Camarero; Ramón Beivide

This paper analyzes the robustness of the king networks for fault tolerance. To this aim, a performance evaluation of two well known fault tolerant routing algorithms in king as well as 2d networks is done. Immunet that uses two virtual channels and Immucube, that has a better performance while requiring three virtual channels. Experimental results confirm the excellent behavior, both in performance and scalability, of the king topologies in the presence of failures. Finally, taking advantage of the topological features of king networks, a new fault tolerance routing algorithm for these networks is presented. From a cost/performance point of view this algorithm is a compromise between the two previous algorithms.

Explore More