Juan M. Cebrián | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan M. Cebrián is active.

Explore More

Publication

Featured researches published by Juan M. Cebrián.

Concurrency and Computation: Practice and Experience | 2014

Toward energy efficiency in heterogeneous processors: findings on virtual screening methods

Ginés D. Guerrero; Juan M. Cebrián; Horacio Pérez-Sánchez; José M. García; Manuel Ujaldon; José M. Cecilia

The integration of the latest breakthroughs in computational modeling and high performance computing (HPC) has leveraged advances in the fields of healthcare and drug discovery, among others. By integrating all these developments together, scientists are creating new exciting personal therapeutic strategies for living longer that were unimaginable not that long ago. However, we are witnessing the biggest revolution in HPC in the last decade. Several graphics processing unit architectures have established their niche in the HPC arena but at the expense of an excessive power and heat. A solution for this important problem is based on heterogeneity. In this paper, we analyze power consumption on heterogeneous systems, benchmarking a bioinformatics kernel within the framework of virtual screening methods. Cores and frequencies are tuned to further improve the performance or energy efficiency on those architectures. Our experimental results show that targeted low‐cost systems are the lowest power consumption platforms, although the most energy efficient platform and the best suited for performance improvement is the Kepler GK110 graphics processing unit from Nvidia by using compute unified device architecture. Finally, the open computing language version of virtual screening shows a remarkable performance penalty compared with its compute unified device architecture counterpart. Copyright

ACM Transactions on Architecture and Code Optimization | 2013

Modeling the impact of permanent faults in caches

Daniel Sánchez; Yiannakis Sazeides; Juan M. Cebrián; José M. García; Juan L. Aragón

The traditional performance cost benefits we have enjoyed for decades from technology scaling are challenged by several critical constraints including reliability. Increases in static and dynamic variations are leading to higher probability of parametric and wear-out failures and are elevating reliability into a prime design constraint. In particular, SRAM cells used to build caches that dominate the processor area are usually minimum sized and more prone to failure. It is therefore of paramount importance to develop effective methodologies that facilitate the exploration of reliability techniques for caches. To this end, we present an analytical model that can determine for a given cache configuration, address trace, and random probability of permanent cell failure the exact expected miss rate and its standard deviation when blocks with faulty bits are disabled. What distinguishes our model is that it is fully analytical, it avoids the use of fault maps, and yet, it is both exact and simpler than previous approaches. The analytical model is used to produce the miss-rate trends (expected miss-rate) for future technology nodes for both uncorrelated and clustered faults. Some of the key findings based on the proposed model are (i) block disabling has a negligible impact on the expected miss-rate unless probability of failure is equal or greater than 2.6e-4, (ii) the fault map methodology can accurately calculate the expected miss-rate as long as 1,000 to 10,000 fault maps are used, and (iii) the expected miss-rate for execution of parallel applications increases with the number of threads and is more pronounced for a given probability of failure as compared to sequential execution.

Computers & Electrical Engineering | 2015

Evaluation of the 3-D finite difference implementation of the acoustic diffusion equation model on massively parallel architectures

Mario Hernández; Baldomero Imbernón; Juan M. Navarro; José M. García; Juan M. Cebrián; José M. Cecilia

The diffusion equation model is a popular tool in room acoustics modeling. The 3-D Finite Difference (3D-FD) implementation predicts the energy decay function and the sound pressure level in closed environments. This simulation is computationally expensive, as it depends on the resolution used to model the room. With such high computational requirements, a high-level programming language (e.g., Matlab) cannot deal with real life scenario simulations. Thus, it becomes mandatory to use our computational resources more efficiently. Manycore architectures, such as NVIDIA GPUs or Intel Xeon Phi offer new opportunities to enhance scientific computations, increasing the performance per watt, but shifting to a different programming model. This paper shows the roadmap to use massively parallel architectures in a 3D-FD simulation. We evaluate the latest generation of NVIDIA and Intel architectures. Our experimental results reveal that NVIDIA architectures outperform by a wide margin the Intel Xeon Phi co-processor while dissipating approximately 50W less (25%) for large-scale input problems.

international conference on e-science | 2015

Early Experiences with Separate Caches for Private and Shared Data

Juan M. Cebrián; Alberto Ros; Ricardo Fernández-Pascual; Manuel E. Acacio

Shared-memory architectures have become predominant in modern multi-core microprocessors in all market segments, from embedded to high performance computing. Correctness of these architectures is ensured by means of coherence protocols and consistency models. Performance and scalability of shared-memory systems is usually limited by the amount and size of the messages used to keep the memory subsystem coherent. Moreover, we believe that blindly keeping coherence for all memory accesses can be counterproductive, since it incurs in unnecessary overhead for data that will remain coherent after the access. Having this in mind, in this paper we propose the use of dedicated caches for private (+shared read-only) and shared data. The private cache (L1P) will be independent for each core while the shared cache (L1S) will be logically shared but physically distributed for all cores. This separation should allow us to simplify the coherence protocol, reduce the on-chip area requirements and reduce invalidation time with minimal impact on performance. The dedicated cache design requires a classification mechanism to detect private and shared data. In our evaluation we will use a classification mechanism that operates at the operating system (OS) level (page granularity). Results show two drawbacks to this approach: first, the selected classification mechanism has too many false positives, thus becoming an important limiting factor. Second, a traditional interconnection network is not optimal for accessing the L1S, and a custom network design is needed. These drawbacks lead to important performance degradation due to the additional latency when accessing the shared data.

Concurrency and Computation: Practice and Experience | 2017

A dedicated private-shared cache design for scalable multiprocessors

Juan M. Cebrián; Ricardo Fernández-Pascual; Alexandra Jimborean; Manuel E. Acacio; Alberto Ros

Most modern architectures are based on a shared‐memory design. Correctness of these architectures is ensured by means of coherence protocols and consistency models. However, performance and scalability of shared‐memory systems is usually constrained by the amount and size of the messages used to keep the memory subsystem coherent. This is not only important in high performance computing, but also in low power embedded systems, specially if coherence is required between different components of the system‐on‐chip. We argue that using the same mechanism to keep coherence for all memory accesses can be counterproductive, because it incurs unnecessary overhead for data addresses that would remain coherent after the access (i.e., private data and read‐only shared data). This paper proposes the use of dedicated caches for two different kinds of data (i) data that can be accessed without contacting other nodes and (ii) modifiable shared data. The private cache (L1P) will be independent for each core and will store private data and read‐only shared data. On the other hand, the shared cache (L1S), will be logically shared but physically distributed for all cores. With this design, we can significantly simplify the coherence protocol, reduce the on‐chip area requirements and reduce invalidation time. However, this dedicated cache design requires a classification mechanism to detect the nature of the data that is being accessed. Results show two drawbacks to this approach: first, the accuracy of the classification mechanism has a huge impact on performance. Second, a traditional interconnection network is not optimal for accessing the L1S, increasing register‐to‐cache latency when accessing shared data. Copyright

international parallel and distributed processing symposium | 2007

Leakage Energy Reduction in Value Predictors through Static Decay

Juan M. Cebrián; Juan L. Aragón; José M. García

As process technology advances toward deep submicron (below 90 nm), static power becomes a new challenge to address for energy-efficient high performance processors, especially for large on-chip array structures such as caches and prediction tables. Value prediction appeared as an effective way of increasing processor performance by overcoming data dependences, but at the risk of becoming a thermal hot spot due to the additional power dissipation. This paper proposes the design of low-leakage value predictors by applying static decay techniques in order to disable unused entries from the prediction tables. We explore decay strategies suited for the three most common Value predictors (STP, FCM and DFCM) studying the particular tradeoffs for these prediction structures. Our mechanism reduces VP leakage energy efficiently without compromising VP accuracy nor processor performance. Results show average leakage energy reductions of 52%, 65% and 75% for the STP, DFCM and FCM value predictors, respectively.

computing frontiers | 2007

Adaptive VP decay: making value predictors leakage-efficient designs for high performance processors

Juan M. Cebrián; Juan L. Aragón; José M. García; Stefanos Kaxiras

Energy-efficient microprocessor designs are one of the major concerns in both high performance and embedded processor domains. Furthermore, as process technology advances toward deep submicron, static power dissipation becomes a new challenge to address, especially for large on-chip array structures such as caches or prediction tables. Value prediction emerged in the recent past as a very effective way of increasing processor performance by overcoming data dependences. The more accurate the value predictor is the more performance is obtained, at the expense of becoming a source of power consumption and a thermal hot spot, and therefore increasing its leakage. Recent techniques, aimed at reducing the leakage power of array structures such as caches, either switch off (non-state preserving) or reduce the voltage level (state-preserving) of unused array portions.In this paper we propose the design of leakage-efficient value predictors by applying adaptive decay techniques in order to disable unused entries in the prediction tables. As value predictors are implemented as non-tagged structures an adaptive decay scheme has no way to precisely determine the induced miss-ratio due to prematurely decaying an entry. This paper explores adaptive decay strategies suited for the particularities of value predictors (Stride, DFCM and FCM) studying the tradeoffs for these prediction structures, that exhibit different pattern access behaviour than caches, in order to reduce their leakage energy efficiently compromising neither VP accuracy nor the speedup provided. Results show average leakage energy reductions of 52%, 70% and 80% for the Stride, DFCM and FCM value predictors of 20 KB respectively.

International Journal of High Performance Computing Applications | 2017

Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance

Mario Hernández; Juan M. Cebrián; José M. Cecilia; José M. García

The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge for hardware and software designers. These requirements are reaching levels where the isolated development on either computational field is not enough to deal with such challenge. A holistic view of the computational thinking is therefore the only way to success in real scenarios. However, this is not a trivial task as it requires, among others, of hardware–software codesign. In the hardware side, most high-throughput computers are designed aiming for heterogeneity, where accelerators (e.g. Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), etc.) are connected through high-bandwidth bus, such as PCI-Express, to the host CPUs. Applications, either via programmers, compilers, or runtime, should orchestrate data movement, synchronization, and so on among devices with different compute and memory capabilities. This increases the programming complexity and it may reduce the overall application performance. This article evaluates different offloading strategies to leverage heterogeneous systems, based on several cards with the first-generation Xeon Phi coprocessors (Knights Corner). We use a 11-point 3-D Stencil kernel that models heat dissipation as a case study. Our results reveal substantial performance improvements when using several accelerator cards. Additionally, we show that computing of an approximate result by reducing the communication overhead can yield 23% performance gains for double-precision data sets.

The Journal of Supercomputing | 2014

Managing power constraints in a single-core scenario through power tokens

Juan M. Cebrián; Daniel Sánchez; Juan L. Aragón; Stefanos Kaxiras

Current microprocessors face constant thermal and power-related problems during their everyday use, usually solved by applying a power budget to the processor/core. Dynamic voltage and frequency scaling (DVFS) has been an effective technique that allowed microprocessors to match a predefined power budget. However, the continuous increase of leakage power due to technology scaling along with low resolution of DVFS makes it less attractive as a technique to match a predefined power budget as technology goes to deep-submicron. In this paper, we propose the use of microarchitectural techniques to accurately match a power constraint while maximizing the energy-efficiency of the processor. We will predict the processor power dissipation at cycle level (power token throttling) or at a basic block level (basic block level mechanism), using the dissipated power translated into tokens to select between different power-saving microarchitectural techniques. We also introduce a two-level approach in which DVFS acts as a coarse-grain technique to lower the average power dissipation towards the power budget, while microarchitectural techniques focus on removing the numerous power spikes. Experimental results show that the use of power-saving microarchitectural techniques in conjunction with DVFS is up to six times more precise, in terms of total energy consumed over the power budget, than only using DVFS to match a predefined power budget.

international conference on parallel processing | 2011

Token3D: reducing temperature in 3d die-stacked CMPs through cycle-level power control mechanisms

Juan M. Cebrián; Juan L. Aragón; Stefanos Kaxiras

Nowadays, chip multiprocessors (CMPs) are the new standard design for a wide range of microprocessors: mobile devices (in the near future almost every smartphone will be governed by a CMP), desktop computers, laptop, servers, GPUs, APUs, etc. This new way of increasing performance by exploiting parallelism has two major drawbacks: off-chip bandwidth and communication latency between cores. 3D die-stacked processors are a recent design trend aimed at overcoming these drawbacks by stacking multiple device layers. However, the increase in packing density also leads to an increase in power density, which translates into thermal problems. Different proposals can be found in the literature to face these thermal problems such as dynamic thermal management (DTM), dynamic voltage and frequency scaling (DVFS), thread migration, etc. In this paper we propose the use of microarchitectural power budget techniques to reduce peak temperature. In particular, we first introduce Token3D, a new power balancing policy that takes into account temperature and layout information to balance the available per core power along other power optimizations for 3D designs. And second, we analyze a wide range of floorplans looking for the optimal temperature configuration. Experimental results show a reduction of the peak temperature of 2-26°C depending on the selected floorplan.

Explore More