Is this you? Create Your Porfile

Vicent Selfa

Polytechnic University of Valencia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vicent Selfa is active.

Explore More

Publication

Featured researches published by Vicent Selfa.

Journal of Parallel and Distributed Computing | 2017

A research-oriented course on Advanced Multicore Architecture

Salvador Petit; Julio Sahuquillo; Mara E. Gmez; Vicent Selfa

This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and by Plan E funds under Grant TIN2014-62246-EXP and Grant TIN2015-66972-C5-1-R, and by Generalitat Valenciana under grant AICO/2016/059. Authors also would like to thank Onur Mutlu for making available online his valuable teaching material.Abstract The fast evolution of multicore processors makes it difficult for professors to offer computer architecture courses with updated contents. To deal with this shortcoming that could discourage students, the most appropriate solution is a research-oriented course based on current microprocessor industry trends. Additionally, we also seek to improve the students’ skills by applying active learning methodologies, where teachers act as guiders and resource providers while students take the responsibility for their learning. In this paper, we present the Advanced Multicore Architecture (AMA) course, which follows a research-oriented approach to introduce students in architectural breakthroughs and uses active learning methodologies to enable students to develop practical research skills such as critical analysis of research papers or communication abilities. To this end five main activities are used: (i) lectures dealing with key theoretical concepts, (ii) paper review & discussion, (iii) research-oriented practical exercises, (iv) lab sessions with a state-of-the-art multicore simulator, and (v) paper presentation. An important part of all these activities is driven by active learning methodologies. Special emphasis is put on the practical side by allocating 40% of the time to labs and exercises. This work also includes an assessment study that analyzes both the course contents and the used methodology (both of them compared to other courses).

parallel, distributed and network-based processing | 2016

A Simple Activation/Deactivation Prefetching Scheme for Chip Multiprocessors

Vicent Selfa; Crispín Gómez; María Engracia Gómez; Julio Sahuquillo

Prefetching significantly reduces the memory latencies of a wide range of applications and thus increases the system performance. However, as a speculative technique, prefetching may also noticeably increase the number of memory accesses, which in turns may negatively impact on the main memory bandwidth consumption, performance, and power. Main memory bandwidth consumption is a critical resource especially in the context of current multicore processors since memory requests from all the cores, both prefetch and demand requests, compete among them in the access to the DRAM banks. Consequently, demand requests may be delayed hurting the system performance. This work proposes the Activation/Deactivation Policies (ADP) scheme for hardware prefetchers in multicore processors. This scheme relies on activation policies that turn on the prefetcher on a given core when it is expected that prefetches will improve the performance, and turn off the prefetcher of that core when it is foreseen that performance will be scarcely improved or not improved at all. The proposed mechanism effectively reduces the memory bandwidth requirements of some cores with respect to a typical always prefetching mechanism, so making available extra bandwidth to the co-runners. Results in a four-core processor show that ADP prefetching achieves similar performance ±2.5% as always prefetching, while significantly reducing the memory bandwidth consumed by use-less prefetches. Moreover, in some applications this reduction is as much as 50%. ADP prefetching is applicable to stream-based prefetchers, global-history-buffer delta correlation prefetchers, and PC-based stride prefetchers.

parallel, distributed and network-based processing | 2015

Methodologies and Performance Metrics to Evaluate Multiprogram Workloads

Vicent Selfa; Julio Sahuquillo; Crispín Gómez; María Engracia Gómez

Multicore processors are dominating the microprocessor market and most research work has moved to this kind of processors. Multicore research methods are still immature and evolving from the single-threaded processor ounterparts. Three main research issues must be faced when evaluating performance and energy in multicores. First, multiple simulation methodologies are being applied to evaluate these systems, without being an agreement about which to use. Second, due to the nature of multiprogram workloads new performance metrics are required, different from those used in single-thread processors. Many metrics have been defined and distinct metrics are used across the published works. Finally, multicore processors are really complex systems which require from sophisticated and complementary (e.g. energy and performance) simulators. This paper pursues to help researchers face the three mentioned research issues. For this purpose, we compare these issues across 28 papers published in 2013 in top computer architecture conferences. Both analytical examples and experimental results are presented with the aim of providing some insights in multicore research.

parallel, distributed and network-based processing | 2015

Row Tables: Design Choices to Exploit Bank Locality in Multiprogram Workloads

Paula Navarro; Vicent Selfa; Julio Sahuquillo; María Engracia Gómez; Crispín Gómez

Main memory is a major performance bottleneck in current chip multiprocessors. Current DRAM banks latch the last accessed row in an internal buffer, namely row buffer (RB), which allows fast subsequent accesses to that row. This throughput-oriented approach was originally designed for single-thread processors and pursues to take advantage of the spatial locality that individual applications exhibit. This paper proposes row tables, a pool of row buffers shared among threads. Depending on the needs of each thread, row buffers are dynamically allocated to threads. Two design approaches are devised differing on the table location, and referred to as BRT (Bank Row Table) and CRT (Controller Row Table), which place the table at the bank, as traditionally done in existing modules, and at the memory controller side, respectively. CRT performs better than BRT in high RB locality applications (or mixes) but performs worse in poor RB locality applications since the increase in transfer times is not later amortized. A variant of CRT referred to as CRT 1/x has been devised to reduce this performance penalty. Results for a 4-core system show that, on average, BRT and CRT 1/x mechanisms save energy by 23% and 7%-16% (depending on the X value) and improve IPC by 10% and 9%-14%, respectively.

european conference on parallel processing | 2018

Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications

Lucia Pons; Vicent Selfa; Julio Sahuquillo; Salvador Petit; Julio Pons

Resource sharing is a major concern in current multicore processors. Among the shared system resources, the Last Level Cache (LLC) is one of the most critical, since destructive interference between applications accessing it implies more off-chip accesses to main memory, which incur long latencies that can severely impact the overall system performance. To help alleviate this issue, current processors implement huge LLCs, but even so, inter-application interference can harm the performance of a subset of the running applications when executing multiprogram workloads. For this reason, recent Intel processors feature Cache Allocation Technologies (CAT) to partition the cache and assign subsets of cache ways to groups of applications. This paper proposes the Critical-Aware (CA) LLC partitioning approach, which leverages CAT and improves the performance of multiprogram workloads, by identifying and protecting the applications whose performance is more damaged by LLC sharing. Experimental results show that CA improves turnaround time on average by 15%, and up to 40% compared to a baseline system without partitioning.

Journal of Parallel and Distributed Computing | 2018

Efficient selective multicore prefetching under limited memory bandwidth

Vicent Selfa; Julio Sahuquillo; María Engracia Gómez; Crispín Gómez

Abstract Current multicore systems implement multiple hardware prefetchers to tolerate long main memory latencies. However, memory bandwidth is a scarce shared resource which becomes critical with the increasing core count. To deal with this fact, recent works have focused on adaptive prefetchers, which control the prefetcher aggressiveness to regulate the main memory bandwidth consumption. Nevertheless, in limited bandwidth machines or under memory-hungry workloads, keeping active the prefetcher can damage the system performance and increase energy consumption. This paper introduces selective prefetching, where individual prefetchers are activated or deactivated to improve both main memory energy and performance, and proposes ADP, a prefetcher that deactivates local prefetchers in some cores when they present low performance and co-runners need additional bandwidth. Based on heuristics, an individual prefetcher is reactivated when performance enhancements are foreseen. Compared to a state-of-the-art adaptive prefetcher, ADP provides both performance and energy enhancements in limited memory bandwidth.

IEEE Transactions on Parallel and Distributed Systems | 2017

A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches

Vicent Selfa; Julio Sahuquillo; Salvador Petit; María Engracia Gómez

Shared caches have become the common design choice in the vast majority of modern multi-core and many-core processors, since cache sharing improves throughput for a given silicon area. Sharing the cache, however, has a downside: the requests from multiple applications compete among them for cache resources, so the execution time of each application increases over isolated execution. The degree in which the performance of each application is affected by the interference becomes unpredictable yielding the system to unfairness situations. This paper proposes Fair-Progress Cache Partitioning (FPCP), a low-overhead hardware-based cache partitioning approach that addresses system fairness. FPCP reduces the interference by allocating to each application a cache partition and adjusting the partition sizes at runtime. To adjust partitions, our approach estimates during multicore execution the time each application would have taken in isolation, which is challenging. The proposed approach has two main differences over existing approaches. First, FPCP distributes cache ways incrementally, which makes the proposal less prone to estimation errors. Second, the proposed algorithm is much less costly than the state-of-the-art ASM-Cache approach. Experimental results show that, compared to ASM-Cache, FPCP reduces unfairness by 48 percent in four-application workloads and by 28 percent in eight-application workloads, without harming the performance.

international conference on parallel architectures and compilation techniques | 2016

Student Research Poster: A Low Complexity Cache Sharing Mechanism to Address System Fairness

Vicent Selfa; Julio Sahuquillo; Salvador Petit; María Engracia Gómez

Shared caches have become, de facto, the common design choice in current multi-cores, ranging from embedded devices to high-performance processors. In these systems, requests from multiple applications compete for the cache resources, degrading to different extents their progress, quantified as the performance of individual applications compared to isolated execution. The difference between the progresses of the running applications yields the system to unpredictable behavior and causes a fairness problem. This problem can be addressed by carefully partitioning cache resources among the contending applications, but to be effective, a partitioning approach needs to estimate the per-application progress.

international parallel and distributed processing symposium | 2015

A Research-Oriented Course on Advanced Multicore Architecture

Julio Sahuquillo; Salvador Petit; Vicent Selfa; María Engracia Gómez

Multicore processors have become ubiquitous in our real life in devices like smartphones, tablets, etc. In fact, they are present in almost all segments of the computing market, from supercomputers to embedded devices. The huge market competence have lead industry and academia to develop vertiginous technological and architectural advances. The fast evolution that are still experiencing current multicores makes difficult for instructors to offer computer architecture courses with updated contents, preferably showing the industry and academia research trends. To deal with this shortcoming, authors consider that a research-oriented course is the most appropriate solution. This paper presents an advanced computer architecture course called Advanced Multicore Architectures, offered in 2015. The course covers the basic topics of multicore architecture and has been organized in four main modules regarding multicore basis, performance evaluation, advanced caching, and main memory organization. The course follows a research-oriented approach that covers theoretical concepts at lectures in which recent research papers are analyzed to provide students a wide view of current trends. Moreover, additional teaching methods like lab sessions with a state-of-the-art multicore simulator or research-oriented exercises have been used with the aim of introducing students to research in these topics. To achieve this fully research-oriented methodology, about 40% of the time is devoted to labs and exercises.

international conference on parallel architectures and compilation techniques | 2017