Juan L. Aragón | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juan L. Aragón is active.

Explore More

Publication

Featured researches published by Juan L. Aragón.

Journal of The Optical Society of America A-optics Image Science and Vision | 2001

Dynamics of the eye's wave aberration

Heidi Hofer; Pablo Artal; Ben Singer; Juan L. Aragón; David R. Williams

It is well known that the eyes optics exhibit temporal instability in the form of microfluctuations in focus; however, almost nothing is known of the temporal properties of the eyes other aberrations. We constructed a real-time Hartmann-Shack (HS) wave-front sensor to measure these dynamics at frequencies as high as 60 Hz. To reduce spatial inhomogeneities in the short-exposure HS images, we used a low-coherence source and a scanning system. HS images were collected on three normal subjects with natural and paralyzed accommodation. Average temporal power spectra were computed for the wave-front rms, the Seidel aberrations, and each of 32 Zernike coefficients. The results indicate the presence of fluctuations in all of the eyes aberration, not just defocus. Fluctuations in higher-order aberrations share similar spectra and bandwidths both within and between subjects, dropping at a rate of approximately 4 dB per octave in temporal frequency. The spectrum shape for higher-order aberrations is generally different from that for microfluctuations of accommodation. The origin of these measured fluctuations is not known, and both corneal/lenticular and retinal causes are considered. Under the assumption that they are purely corneal or lenticular, calculations suggest that a perfect adaptive optics system with a closed-loop bandwidth of 1-2 Hz could correct these aberrations well enough to achieve diffraction-limited imaging over a dilated pupil.

high-performance computer architecture | 2003

Power-aware control speculation through selective throttling

Juan L. Aragón; Jose Gonzalez; Antonio González

With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages).

IEEE Transactions on Computers | 2010

Heterogeneous Interconnects for Energy-Efficient Message Management in CMPs

Antonio Flores; Juan L. Aragón; Manuel E. Acacio

Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processing cores on the same chip. Chip multiprocessors (CMPs) constitute a good alternative to traditional monolithic designs for several reasons, among others, better levels of performance, scalability, and performance/energy ratio. On the other hand, higher clock frequencies and the increasing transistor density have revealed power dissipation and temperature as critical design issues in current and future architectures. Previous studies have shown that the interconnection network of a Chip Multiprocessor (CMP) has significant impact on both overall performance and energy consumption. Moreover, wires used in such interconnect can be designed with varying latency, bandwidth, and power characteristics. In this work, we show how messages can be efficiently managed, from the point of view of both performance and energy, in tiled CMPs using a heterogeneous interconnect. Our proposal consists of two approaches. The first is Reply Partitioning, a technique that splits replies with data into a short Partial Reply message that carries a subblock of the cache line that includes the word requested by the processor plus an Ordinary Reply with the full cache line. This technique allows all messages used to ensure coherence between the L1 caches of a CMP to be classified into two groups: critical and short, and noncritical and long. The second approach is the use of a heterogeneous interconnection network composed of low-latency wires for critical messages and low-energy wires for noncritical ones. Detailed simulations of 8 and 16-core CMPs show that our proposal obtains average savings of 7 percent in execution time and 70 percent in the Energy-Delay squared Product (ED2P) metric of the interconnect over previous works (from 24 to 30 percent average ED2P improvement for the full CMP). Additionally, the sensitivity analysis shows that although the execution time is minimized for subblocks of 16 bytes, the best choice from the point of view of the ED2P metric is the 4-byte subblock configuration with an additional improvement of 2 percent over the 16-byte one for the ED2P metric of the full CMP.

international conference on supercomputing | 2002

Dual path instruction processing

Juan L. Aragón; José González; Antonio González; James E. Smith

The reasons for performance losses due to conditional branch mispredictions are first studied. Branch misprediction penalties are broken into three categories: pipeline-fill penalty, window-fill penalty, and serialization penalty. The first and third of these produce most of the performance loss, but the second is also significant. Previously proposed dual (or multi) path execution methods attempt to reduce all three penalties, but these methods are also quite complex. Most of the complexity is caused by simultaneously executing instructions from multiple paths.A good engineering compromise is to avoid the complexity of multiple path execution by focusing on methods that reduce only the pipeline and window re-fill penalties. Dual Path Instruction Processing (DPIP) is proposed as a simple mechanism that fetches, decodes, and renames, but does not execute, instructions from the alternative path for low confidence predicted branches at the same time as the predicted path is being executed. All the stages of the pipeline front-end are hidden once the misprediction is detected. This method thus targets the pipeline-fill penalty and is shown to achieve a good trade-off between performance and complexity. To reduce the window-fill penalty, we further propose the addition of a pre-scheduling engine that schedules instructions from the alternative path in an estimated execution order. Thus, after a misprediction, a high number of instructions from the alternate path can be immediately issued to execution, achieving an effect similar to very fast re-filling of the window. Performance evaluation of DPIP in a 14-stage superscalar processor (like IBM Power 4) shows an average IPC improvement of up to 10% for the bzip2 benchmark, and an average of 8% for ten benchmarks from the SPECint95 and SPECint2000 suites.

advanced information networking and applications | 2007

Sim-PowerCMP: A Detailed Simulator for Energy Consumption Analysis in Future Embedded CMP Architectures

Antonio Flores; Juan L. Aragón; Manuel E. Acacio

Continuous improvements in integration scale have made major microprocessor vendors to move to designs that integrate several processor cores on the same chip. Chip-multiprocessors (CMPs) constitute the architecture of choice in the high performance embedded domain for several reasons such as better levels of scalability and performance/energy ratio. On the other hand, higher clock frequencies and increasing transistor density have revealed power dissipation as a critical design issue, especially in embedded systems where reduced energy consumption directly translates into extended battery life. In this work we present Sim-PowerCMP, a detailed architecture-level power-performance simulation tool for CMP architectures that integrates several well-known contemporary simulators (RSIM, HotLeakage and Orion) into a single framework. As a case of use of Sim-PowerCMP, we present a characterization of the energy-efficiency of a CMP for parallel scientific applications, paying special attention to the energy consumed on the interconnect. Results for an 8- and 16-core CMP show that the contribution of the interconnection network to the total power is close to 20%, on average, and that the most consuming messages are replies that carry data (almost 70% of total energy consumed in the interconnect).

international symposium on microarchitecture | 2015

DeSC: decoupled supply-compute communication management for heterogeneous architectures

Tae Jun Ham; Juan L. Aragón; Margaret Martonosi

Todays computers employ significant heterogeneity to meet performance targets at manageable power. In adopting increased compute specialization, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC offers an average of 2.04× speedup over baseline (on homogeneous CMPs) and 1.56× speedup when a DeSC data supplier feeds data to a hardware accelerator. Achieving performance very close to what a perfect cache hierarchy would offer, DeSC offers the performance gains of specialized communication acceleration while maintaining useful generality across platforms.

international parallel and distributed processing symposium | 2011

Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads

Juan M. Cebri´n; Juan L. Aragón; Stefanos Kaxiras

In the recent years virtually all processor architectures employ multiple cores per chip (CMPs). It is possible to use legacy (i.e., single-core) power saving techniques in CMPs which run either sequential applications or independent multithreaded workloads. However, new challenges arise when running parallel shared-memory applications. In the later case, sacrificing some performance in a single core (thread) in order to be more energy-efficient might unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), therefore, harming the performance of the whole application. CMPs increasingly face thermal and power-related problems during their typical use. Such problems can be solved by setting a power budget to the processor/core. This paper initially studies the behavior of different techniques to match a predefined power budget in a CMP processor. While legacy techniques properly work for thread independent/multi-programmed workloads, parallel workloads exhibit the problem of independently adapting the power of each core in a thread dependent scenario. In order to solve this problem we propose a novel mechanism, Power Token Balancing (PTB), aimed at accurately matching an external power constraint by balancing the power consumed among the different cores using a power token-based approach while optimizing the energy efficiency. We can use power (seen as tokens or coupons) from non-critical threads for the benefit of critical threads. PTB runs transparent for thread independent / multiprogrammed workloads and can be also used as a spin lock detector based on power patterns. Results show that PTB matches more accurately a predefined power budget (total energy consumed over the budget is reduced to 8\% for a 16-core CMP) than DVFS with only a 3\% energy increase. Finally, we can trade accuracy on matching the power budget for energy-efficiency reducing the energy a 4% with a 20% of accuracy.

automation, robotics and control systems | 2010

MLP-Aware instruction queue resizing: the key to power-efficient performance

Pavlos Petoumenos; Georgia Psychou; Stefanos Kaxiras; Juan González; Juan L. Aragón

Several techniques aiming to improve power-efficiency (measured as EDP) in out-of-order cores trade energy with performance. Prime exam ples are the techniques to resize the instruction queue (IQ). While most of them produce good results, they fail to take into account that changing the timing of memory accesses can have significant consequences on the memo ry-level parallelism (MLP) of the application and thus incur disproportional performance degradation. We propose a novel mechanism that deals with this realization by collecting fine-grain information about the maximum IQ resiz ing that does not affect the MLP of the program. This information is used to override the resizing enforced by feedback mechanisms when this resizing might reduce MLP. We compare our technique to a previously proposed non-MLP-aware management technique and our results show a significant in crease in EDP savings for most benchmarks of the SPEC2000 suite.

ieee international conference on high performance computing data and analytics | 2001

Confidence Estimation for Branch Prediction Reversal

Juan L. Aragón; José González; José M. García; Antonio González

Branch prediction reversal has been proved to be an effective alternative approach to dropping misprediction rates by means of adding a Confidence Estimator to a correlating branch predictor. This paper presents a Branch Prediction Reversal Unit (BPRU) especially oriented to enhance correlating branch predictors, such as the gshare and the Alpha 21264 metapredictor. The novelty of this proposal lies on the inclusion of data values in the confidence estimation process. Confidence metrics show that the BPRU can correctly tag 43% of branch mispredictions as low confident predictions, whereas the SBI (a previously proposed estimator) just detects 26%. Using the BPRU to reverse the gshare branch predictions leads to misprediction reductions of 15% for the SPECint2000 (up to 27% for some applications). Furthermore, the BPRU+gshare predictor reduces the misprediction rate of the SBI+gshare by an average factor of 10%. Performance evaluation of the BPRU in a superscalar processor obtains speedups of up to 9%. Similar results are obtained when the BPRU is combined with the Alpha 21264 branch predictor.

international on line testing symposium | 2011

An analytical model for the calculation of the Expected Miss Ratio in faulty caches

Daniel Sánchez; Yiannakis Sazeides; Juan L. Aragón; Josó M. García

Technology scaling improvement is affecting the reliability of ICs due to increases in static and dynamic variations as well as wear-out failures. This is particularly true for caches that dominate the area of modern processors and are built with minimum-sized, but prone to failure, SRAM cells. Our attempt to address this cache reliability challenge is an analytical model for determining the implications on cache miss-rate of block-disabling due to random cell failure. The proposed model is distinct from previous work in that is an exact model rather than an approximation and yet it is simpler than previous work. Its simplicity stems from the lack of fault-maps in the analysis. The model capabilities are illustrated through a study of cache miss-rate trends in future technology nodes. The model is also used to determine the accuracy of a random fault map methodology. The analysis reveals, for the assumptions, programs and cache configuration used in this study, a surprising result: a relative small number of random fault maps, 100–1000, is sufficient to obtain accurate mean and standard-deviation values for the miss-rate. Additional investigation revealed that the cause of this behavior is a high correlation between the number of accesses and access distribution between cache sets.

Explore More