Pablo Prieto | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pablo Prieto is active.

Explore More

Publication

Featured researches published by Pablo Prieto.

international symposium on computer architecture | 2007

Rotary router: an efficient architecture for CMP interconnection networks

Pablo Abad; Valentin Puente; José A. Gregorio; Pablo Prieto

The trend towards increasing the number of processor cores and cache capacity in future Chip-Multiprocessors (CMPs), will require scalable packet-switched interconnection networks adapted to the restrictions imposed by the CMP environment. This paper presents an innovative router design, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or anti-clockwise, traveling through every port of the router. It uses a completely decentralized scheduling scheme, which allows the design to: (1) take advantage of wide links, (2) reduce Head of Line blocking, (3) use adaptive routing, (4) be topology agnostic, (5) scale with network degree, and (6) have reasonable power consumption and implementation cost. A thorough comparative performance analysis against competitive conventional routers shows an advantage for our proposal of up to 50 % in terms of raw performance and nearly 60 % in terms of energy-delay product.

networks on chips | 2012

TOPAZ: An Open-Source Interconnection Network Simulator for Chip Multiprocessors and Supercomputers

Pablo Abad; Pablo Prieto; Lucia G. Menezo; AdriÂ´n Colaso; Valentin Puente; José-Ángel Gregorio

As in other computer architecture areas, interconnection networks research relies most of the times on simulation tools. This paper announces the release of an open-source tool suitable to be used for accurate modeling from small CMP to large supercomputer interconnection networks. The cycle-accurate modeling of TOPAZ can be used standalone through synthetic traffic patterns and application-traces or within full-system evaluation systems such as GEMS or GEM5 effortlessly. In fact, we provide an advanced interface that enables the replacement of the original lightweight but optimistic GEMS and GEM5 network simulator with limited performance impact on the simulation time. Our tests indicate that in this context, underestimating network modeling could induce up to 50% error in the performance estimation of the simulated system. To minimize the impact of detailed network modeling on simulation time, we incorporate mechanisms able to attenuate the higher computational effort, reducing in this way the slowdown of the full system simulation with accurate performance estimations. Additionally, in order to evaluate large-scale networks, we parallelize the simulator to be able to optimize memory resources with the growing number of cores available per chip in the simulation farms. This allows us to simulate node networks exceeding one million of routers with up to 70% efficiency in a multithreaded simulation running on twelve cores.

ACM Sigarch Computer Architecture News | 2008

SP-NUCA: a cost effective dynamic non-uniform cache architecture

Javier Merino; Valentin Puente; Pablo Prieto; José A. Gregorio

This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed. To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA. Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.

IEEE Computer Architecture Letters | 2011

Multilevel Cache Modeling for Chip-Multiprocessor Systems

Pablo Prieto; Valentin Puente; José-Ángel Gregorio

This paper presents a simple analytical model for predicting on-chip cache hierarchy effectiveness in chip multiprocessors (CMP) for a state-of-the-art architecture. Given the complexity of this type of systems, we use rough approximations, such as the empirical observation that the re-reference timing pattern follows a power law and the assumption of a simplistic delay model for the cache, in order to provide a useful model for the memory hierarchy responsiveness. This model enables the analytical determination of average access time, which makes design space pruning useful before sweeping the vast design space of this class of systems. The model is also useful for predicting cache hierarchy behavior in future systems. The fidelity of the model has been validated using a state-of-the-art, full-system simulation environment, on a system with up to sixteen out-of-order processors with cache-coherent caches and using a broad spectrum of applications, including complex multithread workloads. This simple model can predict a near-to-optimal, on-chip cache distribution while also estimating how future system running future applications might behave.

international conference on supercomputing | 2013

CMP off-chip bandwidth scheduling guided by instruction criticality

Pablo Prieto; Valentin Puente; José A. Gregorio

This paper explores the benefits of scheduling off-chip memory operations in a Chip Multiprocessor (CMP) according to their execution relevance. Assuming the scenario of having many out-of-order execution cores in the CMP, from the processor perspective, the importance of the instruction that triggers an access to off-chip memory may vary considerably. Consequently, it makes sense to consider this point of view at the memory controller level to reorder outgoing memory accesses. After exploring different processor-centric sorting criteria, we reach the conclusion that the most simple and useful metric for scheduling a memory operation is the position in the reorder buffer of the instruction that triggers the on-chip miss. We propose a simple memory controller scheduling policy that employs this information as its main parameter. This proposal significantly improves system responsiveness, both in terms of throughput and fairness. The idea is analyzed through full-system simulation, running a broad set of workloads with diverse memory behavior. When it is compared with other scheduling algorithms with similar complexity, throughput can be improved by an average of 10% and fairness enhanced by an average of 15% even in very adverse usage scenarios. Moreover, the idea supports the possibility of dynamically favoring throughput or fairness, according to the end-user requirements.

IEEE Transactions on Parallel and Distributed Systems | 2016

AC-WAR: Architecting the Cache Hierarchy to Improve the Lifetime of a Non-Volatile Endurance-Limited Main Memory

Pablo Abad; Pablo Prieto; Valentin Puente; José-Ángel Gregorio

This work shows how by adapting replacement policies in contemporary cache hierarchies it is possible to extend the lifespan of a write endurance-limited main memory by almost one order of magnitude. The inception of this idea is that during cache residency 1) blocks are modified in a bimodal way: either most of the content of the block is modified or most of the content of the block never changes, and 2) in most applications, the majority of blocks are only slightly modified. When those facts are considered by the cache replacement algorithms, it is possible to significantly reduce the number of bit-flips per write-back to main memory. Our proposal favors the off-chip eviction of slightly modified blocks according to an adaptive replacement algorithm that operates coordinately in L2 and L3. This way it is possible to improve significantly system memory lifetime, with negligible performance degradation. We found that using a few bits per block to track changes in cache blocks with respect to the main memory content is enough. With a slightly modified sectored LRU and a simple cache performance predictor it is possible to achieve a simple implementation with minimal cost in area and no impact on cache access time. On average, our proposal increases the memory lifetime obtained with an LRU policy up to 10 times (10×) and 15 times (15×) when combined with other memory centric techniques. In both cases, the performance degradation could be considered negligible.

parallel computing | 2015

Improving last level shared cache performance through mobile insertion policies (MIP)

Pablo Abad; Pablo Prieto; Valentin Puente; José-Ángel Gregorio

We show the high variability in last-level cache access patterns.When applications interact, current dynamic policies have not enough flexibility.We present MIP, a replacement policy based on mobile insertion position.MIP rapidly adapts to sudden changes in access patterns during application runtime.Our solution provides 30% better hit-rate than LRU and 10% than DRRIP. For those cache hierarchy levels where program locality is not as evident as in L1, LRU replacement does not seem to be the optimal solution to determine which blocks will be requested soon. The literature is prolific on alternative reuse-distance estimations at last on-chip cache level, proving the difficulty of achieving an optimal hit rate. One of the key aspects for performance is knowing inter and intra application reuse-distance variability. Many solutions already do this, but most of them rely on a simple choice among a few alternative policies. The experiments performed to motivate the proposal confirm application variability, but also show that the behavior of applications is much more than bimodal. This means that there is a performance gap that current hybrid policies are not able to cover. In this paper we propose a mobile insertion position replacement policy (MIP), which combines well known LRU ordering and promotion policies with a completely adaptive insertion mechanism. The dynamic behavior of insertion is able to capture hit-rate variability in a more accurate way. Making use of set dueling and dynamic set sampling for prediction, our mechanism continuously estimates the insertion position that maximizes the cache hit rate. The hardware overhead compared to a LRU replacement algorithm is merely three 3-bit saturating counters per LLC bank. Our experiments show that for a wide range of applications, MIP is able to improve the hit rate of LRU by 30% on average. MIP outperforms current state-of-the-art replacement policies with a similar implementation cost by 10% on average and in single-thread or multi-thread workloads by 20%.

IEEE Transactions on Parallel and Distributed Systems | 2018

Memory Hierarchy Characterization of NoSQL Applications through Full-System Simulation

Adrian Colaso; Pablo Prieto; Jose Angel Herrero; Pablo Abad; Lucia G. Menezo; Valentin Puente; José A. Gregorio

In this work, we conduct a detailed memory characterization of a representative set of modern data-management software (Cassandra, MongoDB, OrientDB and Redis) running an illustrative NoSQL benchmark suite (YCSB). These applications are widely popular NoSQL databases with different data models and features such as in-memory storage. We compare how these data-serving applications behave with respect to other well-known benchmarks, such as SPEC CPU2006, PARSEC and NAS Parallel Benchmark. The methodology employed for evaluation relies on state-of-the-art full-system simulation tools, such as gem5. This allows us to explore configurations unattainable using performance monitoring units in actual hardware, being able to characterize memory properties. The results obtained suggest that NoSQL application behavior is not dissimilar to conventional workloads. Therefore, some of the optimizations present in state-of-the-art hardware might have a direct benefit. Nevertheless, there are some common aspects that are distinctive of conventional benchmarks that might be sufficiently relevant to be considered in architectural design. Strikingly, we also found that most database engines, independently of aspects such as workload or database size, exhibit highly uniform behavior. Finally, we show that different data-base engines make highly distinctive demands on the memory hierarchy, some being more stringent than others.

international conference on computer design | 2012

BIXBAR: A low cost solution to support dynamic link reconfiguration in networks on chip

Pablo Abad; Pablo Prieto; Valentin Puente; José-Ángel Gregorio

Improving link utilization is a key aspect in interconnection network design. Reconfigurable-direction interrouter links optimize network resource utilization, which substantially increases the maximum achievable throughput. In the case of On-chip Networks, the short distance between adjacent routers makes feasible fast link arbitration, which makes dynamic link reconfiguration an attractive solution. In this paper we propose a low-cost router micro-architecture that is able to deal with reconfigurable links with a marginal cost over a conventional router. The key element of the proposal is a bidirectional crossbar, which enables reconfiguration of links, without significantly increasing router area and energy. The results obtained indicate that with this proposal, system performance could be improved, for some selected workloads, by up to 25% while energy-performance tradeoff is reduced by 20%, avoiding the additional costs entailed in other state-of-the-art routers capable of performing dynamic link reconfiguration.

digital systems design | 2013