Ramon Canal
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ramon Canal.
international symposium on microarchitecture | 2007
Xiaoyao Liang; Ramon Canal; Gu-Yeon Wei; David M. Brooks
Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tolerate large process variations with little or even no impact on performance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new memory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.
high performance computer architecture | 2000
Ramon Canal; Joan-Manuel Parcerisa; Antonio González
Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-point instructions to the other. This partitioning scheme is simple and results in no communications between the two clusters (just through memory) but it is in general far from optimal because she workload is not evenly distributed most of the time. In fact, when the processor is running integer programs, the workload is extremely unbalanced since the FP cluster is not used at all. In this work we investigate run-time mechanisms that dynamically distribute the instructions of a program among these two clusters. By optimizing the trade-off between inter-cluster communication penalty and workload balance, the proposed schemes can achieve an average speed-up of 36% for the SpecInt95 benchmark suite.
international symposium on microarchitecture | 2000
Ramon Canal; Antonio González; James E. Smith
Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the significant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bits flowing down the pipeline to enable pipeline operations only for the significant bytes. Consequently, register logic and cache activity (and dynamic power) are substantially reduced. An initial trace-driven study shows reduction in activity of approximately 30-40% for each pipeline stage. Several pipeline organizations are studied. A byte serial pipeline is the simplest implementation, but suffers a CPI (cycles per instruction) increase of 79% compared with a conventional 32-bit pipeline. Widening certain pipeline stages in order to balance processing bandwidth leads to an implementation with a CPI 24% higher than the baseline 32-bit design. Finally, full-width pipeline stages with operand gating achieve a CPI within 2-6% of the baseline 32-bit pipeline.
international conference on supercomputing | 2006
Matteo Monchiero; Ramon Canal; Antonio González
Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design.This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice.
international conference on supercomputing | 2000
Ramon Canal; Antonio González
One of the main concerns in todays processor design is the issue logic. Instruction-level parallelism is usually favored by an out-of-order issue mechanism where instructions can issue independently of the program order. The out-of-order scheme yields the best performance but at the same time introduces important hardware costs such as an associative look-up, which might be prohibitive for wide issue processors with large instruction windows. This associative search may slow-down the clock-rate and it has an important impact on power consumption. In this work, two new issue schemes that reduce the hardware complexity of the issue logic with minimal impact on the average number of instructions executed per cycle are presented.
international conference on supercomputing | 2001
Ramon Canal; Antonio González
The issue logic of dynamically scheduled superscalar processors is one of their most complex and power-consuming parts. In this paper we present alternative issue-logic designs that are much simpler than the traditional scheme while they retain most of its ability to exploit ILP. These alternative schemes are based on the observation that most values produced by a program are used by very few instructions, and the latencies of most operation are deterministic.
IEEE Transactions on Parallel and Distributed Systems | 2008
Matteo Monchiero; Ramon Canal; Antonio González
Multicore architectures have been ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one), and the ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture that we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1, shared/private L2, and a shared bus interconnect. We consider a benchmark set composed of several parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size, and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption, and temperature. Design trade-offs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed, showing the importance of considering thermal effects at the architectural level to achieve the best design choice.
international symposium on computer architecture | 2010
Enric Herrero; Jose Gonzalez; Ramon Canal
Next generation tiled microarchitectures are going to be limited by off-chip misses and by on-chip network usage. Furthermore, these platforms will run an heterogeneous mix of applications with very different memory needs, leading to significant optimization opportunities. Existing adaptive memory hierarchies use either centralized structures that limit the scalability or software based resource allocation that increases programming complexity. We propose Elastic Cooperative Caching, a dynamic and scalable memory hierarchy that adapts automatically and autonomously to application behavior for each node. Our configuration uses elastic shared/private caches with fully autonomous and distributed repartitioning units for better scalability. Furthermore, we have extended our elastic configuration with an Adaptive Spilling mechanism to use the shared cache space only when it can produce a performance improvement. Elastic caches allow both the creation of big local private caches for threads with high reuse of private data and the creation of big shared spaces from unused caches. Local data allocation in private regions allows to reduce network usage and efficient cache partitioning allows to reduce off-chip misses. The proposed scheme outperforms previous proposals by a minimum of 12% (on average across the benchmarks) and reduces the number of offchip misses by 16%. Plus, the dynamic and autonomous management of cache resources avoids the reallocation of cache blocks without reuse which results in an increase in energy efficiency of 24%.
international conference on parallel architectures and compilation techniques | 1999
Ramon Canal; Joan-Manuel Parcerisa; Antonio González
In current superscalar processors, all floating-point resources are idle during the execution of integer programs. As previous works show, this problem can be alleviated if the floating-point cluster is extended to execute simple integer instructions. With minor hardware modifications to a conventional superscalar processor, the issue width can potentially be doubled without increasing the hardware complexity. In fact, the result is a clustered architecture with two heterogeneous clusters. We propose to extend this architecture with a dynamic steering logic that sends the instructions to either cluster. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses run-time information to optimise the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms the previously proposed one.
international symposium on computer architecture | 2013
Naifeng Jing; Yao Shen; Yao Lu; Shrikanth Ganapathy; Zhigang Mao; Minyi Guo; Ramon Canal; Xiaoyao Liang
The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.