Is this you? Create Your Porfile

Felipe Cabarcas

Polytechnic University of Catalonia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Felipe Cabarcas is active.

Explore More

Publication

Featured researches published by Felipe Cabarcas.

international symposium on performance analysis of systems and software | 2011

Trace-driven simulation of multithreaded applications

Alejandro Rico; Alejandro Duran; Felipe Cabarcas; Yoav Etsion; Alex Ramirez; Mateo Valero

Over the past few years, computer architecture research has moved towards execution-driven simulation, due to the inability of traces to capture timing-dependent thread execution interleaving. However, trace-driven simulation has many advantages over execution-driven that are being missed in multithreaded application simulations. We present a methodology to properly simulate multithreaded applications using trace-driven environments. We distinguish the intrinsic application behavior from the computation for managing parallelism. Application traces capture the intrinsic behavior in the sections of code that are independent from the dynamic multithreaded nature, and the points where parallelism-management computation occurs. The simulation framework is composed of a trace-driven simulation engine and a dynamic-behavior component that implements the parallelism-management operations for the application. Then, at simulation time, these operations are reproduced by invoking their implementation in the dynamic-behavior component. The decisions made by these operations are based on the simulated architecture, allowing to dynamically reschedule sections of code taken from the trace to the target simulated components. As the captured sections of code are independent from the parallel state of the application, they can be simulated on the trace-driven engine, while the parallelism-management operations, that require to be re-executed, are carried out by the execution-driven component, thus achieving the best of both trace- and execution-driven worlds. This simulation methodology creates several new research opportunities, including research on scheduling and other parallelism-management techniques for future architectures, and hardware support for programming models.

international symposium on microarchitecture | 2010

The SARC Architecture

Alex Ramirez; Felipe Cabarcas; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Friman Sánchez; Arnaldo Azevedo; Cor Meenderinck; Catalin Bogdan Ciobanu; Sebastian Isaza; Gerogi Gaydadjiev

The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARCs programming model supports various highly parallel applications, with matching support from specialized accelerator processors.

high performance embedded architectures and compilers | 2012

On the simulation of large-scale architectures using multiple application abstraction levels

Alejandro Rico; Felipe Cabarcas; Carlos Villavieja; Milan Pavlovic; Augusto Vega; Yoav Etsion; Alex Ramirez; Mateo Valero

Simulation is a key tool for computer architecture research. In particular, cycle-accurate simulators are extremely important for microarchitecture exploration and detailed design decisions, but they are slow and, so, not suitable for simulating large-scale architectures, nor are they meant for this. Moreover, microarchitecture design decisions are irrelevant, or even misleading, for early processor design stages and high-level explorations. This allows one to raise the abstraction level of the simulated architecture, and also the application abstraction level, as it does not necessarily have to be represented as an instruction stream. In this paper we introduce a definition of different application abstraction levels, and how these are employed in TaskSim, a multi-core architecture simulator, to provide several architecture modeling abstractions, and simulate large-scale architectures with hundreds of cores. We compare the simulation speed of these abstraction levels to the ones in existing simulation tools, and also evaluate their utility and accuracy. Our simulations show that a very high-level abstraction, which may be even faster than native execution, is useful for scalability studies on parallel applications; and that just simulating explicit memory transfers, we achieve accurate simulations for architectures using non-coherent scratchpad memories, with just a 25x slowdown compared to native execution. Furthermore, we revisit trace memory simulation techniques, that are more abstract than instruction-by-instruction simulations and provide an 18x simulation speedup.

ieee international conference on high performance computing data and analytics | 2009

CellSs: Scheduling techniques to better exploit memory hierarchy

Pieter Bellens; Josep M. Perez; Felipe Cabarcas; Alex Ramirez; Rosa M. Badia; Jesús Labarta

Cell Superscalars (CellSs) main goal is to provide a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of the applications at a task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that takes care of the concurrent execution of the application. The first efforts for task scheduling in CellSs derived from very simple heuristics. This paper presents new scheduling techniques that have been developed for CellSs for the purpose of improving an applications performance. Additionally, the design of a new scheduling algorithm is detailed and the algorithm evaluated. The CellSs scheduler takes an extension of the memory hierarchy for Cell/B.E. into account, with a cache memory shared between the SPEs. All new scheduling practices have been evaluated showing better behavior of our system.

european conference on parallel processing | 2010

Long DNA sequence comparison on multicore architectures

Friman Sánchez; Felipe Cabarcas; Alex Ramirez; Mateo Valero

Biological sequence comparison is one of the most important tasks in Bioinformatics. Due to the growth of biological databases, sequence comparison is becoming an important challenge for high performance computing, especially when very long sequences are compared. The Smith-Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). In this work, we show how long sequences comparison takes advantage of current and future multicore architectures. We analyze two different SW implementations on the CellBE and use simulation tools to study the performance scalability in a multicore architecture. We study the memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture is an valid alternative to execute challenging bioinformatic workloads.

international conference on embedded computer systems: architectures, modeling, and simulation | 2011

Breaking the bandwidth wall in chip multiprocessors

Augusto Vega; Felipe Cabarcas; Alex Ramirez; Mateo Valero

In throughput-aware CMPs like GPUs and DSPs, software-managed streaming memory systems are an effective way to tolerate high latencies. E.g., the Cell/B.E. incorporates local memories, and data transfers to/from those memories are overlapped with computation using DMAs. In such designs, the latency of the memory system has little impact on performance; instead, memory bandwidth becomes critical. With the increase in the number of cores, conventional DRAMs no longer suffice to satisfy the bandwidth demand. Hence, recent throughput-aware CMPs adopted caches to filter off-chip traffic. However, such caches are optimized for latency, not bandwidth. This work presents a re-design of the memory system in throughput-aware CMPs. Instead of a traditional latency-aware cache, we propose to spread the address space using fine-grained interleaving all over a shared non-coherent last-level cache (LLC). In this way, on-chip storage is optimally used, with no need to keep coherency. On the memory side, we also propose the use of interleaving across DRAMs but with a much finer granularity than usual page-size approaches. Our proposal is highly optimized for bandwidth, not latency, by avoiding data replication in the LLC and by using fine-grained address space interleaving in both the LLC and the memory. For a CMP with 128 cores and 64-MB LLC, performance is improved by 21% due to the LLC optimizations and an extra 42% due to the off-chip memory optimizations, for a total 1.7 times performance improvement.

Concurrency and Computation: Practice and Experience | 2011

Scalable multicore architectures for long DNA sequence comparison

Friman Sánchez; Felipe Cabarcas; Alex Ramirez; Mateo Valero

Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

Interleaving granularity on high bandwidth memory architecture for CMPs

Felipe Cabarcas; Alejandro Rico; Yoav Etsion; Alex Ramirez

Memory bandwidth has always been a critical factor for the performance of many data intensive applications. The increasing processor performance, and the advert of single chip multiprocessors have increased the memory bandwidth demands beyond what a single commodity memory device can provide. The immediate solution is to use more than one memory device, and interleave data across them so they can be used in parallel as if they were a single device of higher bandwidth. In this paper we showed that fine-grain memory interleaving on the evaluated many-core architectures with many DRAM channels was critical to achieve high memory bandwidth efficiency. Our results showed that performance can degrade up to 50% due to achievable bandwidths being far from the maximum installed.

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies | 2010

Comparing last-level cache designs for CMP architectures

Augusto Vega; Alejandro Rico; Felipe Cabarcas; Alex Ramirez; Mateo Valero

The emergence of hardware accelerators, such as graphics processing units (GPUs), has challenged the interaction between processing elements (PEs) and main memory. In architectures like the Cell/B.E. or GPUs, the PEs incorporate local memories which are fed with data transferred from memory using direct memory accesses (DMAs). We expect that chip multiprocessors (CMP) with DMA-managed local memories will become more popular in the near future due to the increasing interest in accelerators. In this work we show that, in that case, the way cache hierarchies are conceived should be revised. Particularly for last-level caches, the norm today is to use latency-aware organizations. For instance, in dynamic nonuniform cache architectures (D-NUCA) data is migrated closer to the requester processor to optimize latency. However, in DMA-based scenarios, the memory system latency becomes irrelevant compared with the time consumed for moving the DMA data, so latency-aware designs are, a priori, inefficient. In this work, we revisit the last-level cache designs in DMA-based CMP architectures with master-worker execution. Two scenarios are evaluated. First, we consider a set of private caches with data replication across them, where coherency of the copies is ensured through a hardware protocol. In this scenario, a PE has a nearby copy of the datum, improving cache access latency. Second, we consider a partitioned cache, where the allocation of a datum to a cache block is determined based on its physical address. In this scenario, there are no copies of data, and access to a datum has a variable latency. In contrast with traditional load/store-based architectures, we found that the partitioned last-level cache scheme outperforms the cache with data replication for DMA-based scenarios.

IEEE Transactions on Computers | 2012

DMA++: On the Fly Data Realignment for On-Chip Memories

Nikola Vujic; Felipe Cabarcas; Marc Gonzalez Tallada; Alex Ramirez; Xavier Martorell; Eduard Ayguadé

Multimedia extensions based on Single-Instruction Multiple-Data (SIMD) units are widespread. They have been used, for some time, in processors and accelerators (e.g., the Cell SPEs). SIMD units usually have significant memory alignment constraints in order to meet power requirements and design simplicity. This increases the complexity of the code generated by the compiler as, in the general case, the compiler cannot be sure of the proper alignment of data. For that, the ISA provides either unaligned memory load and store instructions, or a special set of instructions to perform realignments in software. In this paper, we propose a hardware realignment unit that takes advantage of the DMA transfers needed in accelerators with local memories. While the data are being transferred, it is realigned on the fly by our realignment unit, and stored at the desired alignment in the accelerator memory. This mechanism can help programmers to better organize data in the accelerator memory so that the accelerator can possibly access the data with no special instructions. Finally, the data are realigned properly also when put back to main memory. Our experiments with nine applications show that with our approach, the bandwidth of the DMA transfers is not penalized.

Explore More