Vasileios Porpodas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vasileios Porpodas is active.

Explore More

Publication

Featured researches published by Vasileios Porpodas.

high-performance computer architecture | 2012

Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs

Karthik T. Sundararajan; Vasileios Porpodas; Timothy M. Jones; Nigel P. Topham; Björn Franke

Intelligently partitioning the last-level cache within a chip multiprocessor can bring significant performance improvements. Resources are given to the applications that can benefit most from them, restricting each core to a number of logical cache ways. However, although overall performance is increased, existing schemes fail to consider energy saving when making their partitioning decisions. This paper presents Cooperative Partitioning, a runtime partitioning scheme that reduces both dynamic and static energy while maintaining high performance. It works by enforcing cached data to be way-aligned, so that a way is owned by a single core at any time. Cores cooperate with each other to migrate ways between themselves after partitioning decisions have been made. Upon access to the cache, a core needs only to consult the ways that it owns to find its data, saving dynamic energy. Unused ways can be power-gated for static energy saving. We evaluate our approach on two-core and four-core systems, showing that we obtain average dynamic and static energy savings of 35% and 25% compared to a fixed partitioning scheme. In addition, Cooperative Partitioning maintains high performance while transferring ways five times faster than an existing state-of-the-art technique.

symposium on code generation and optimization | 2015

PSLP: padded SLP automatic vectorization

Vasileios Porpodas; Alberto Magni; Timothy M. Jones

The need to increase performance and power efficiency in modern processors has led to a wide adoption of SIMD vector units. All major vendors support vector instructions and the trend is pushing them to become wider and more powerful. However, writing code that makes efficient use of these units is hard and leads to platform-specific implementations. Compiler-based automatic vectorization is one solution for this problem. In particular the Superword-Level Parallelism (SLP) vectorization algorithm is the primary way to automatically generate vector code starting from straight-line scalar code. SLP is implemented in all major compilers, including GCC and LLVM. SLP relies on finding sequences of isomorphic instructions to pack together into vectors. However, this hinders the applicability of the algorithm as isomorphic code sequences are not common in practice. In this work we propose a solution to overcome this limitation. We introduce Padded SLP (PSLP), a novel vectorization algorithm that can vectorize code containing non-isomorphic instruction sequences. It injects a near-minimal number of redundant instructions into the code to transform non-isomorphic sequences into isomorphic ones. The padded instruction sequence can then be successfully vectorized. Our experiments show that PSLP improves vectorization coverage across a number of kernels and full benchmarks, decreasing execution time by up to 63%.

international conference on parallel architectures and compilation techniques | 2015

Throttling Automatic Vectorization: When Less is More

Vasileios Porpodas; Timothy M. Jones

SIMD vectors are widely adopted in modern general purpose processors as they can boost performance and energy efficiency for certain applications. Compiler-based automatic vectorization is one approach for generating codethat makes efficient use of the SIMD units, and has the benefit of avoiding hand development and platform-specific optimizations. The Superword-Level Parallelism (SLP) vectorization algorithm is the most well-known implementation of automatic vectorization when starting from straight-line scalar code, and is implemented in several major compilers. The existing SLP algorithm greedily packs scalar instructions into vectors starting from stores and traversing the data dependence graph upwards until it reaches loads or non-vectorizable instructions. Choosing whether to vectorize is a one-off decision for the whole graph that has been generated. This, however, is sub-optimal because the graph may contain code that is harmful to vectorization due to the need to move data from scalar registers into vectors. The decision does not consider the potential benefits of throttling the graph by removing this harmful code. In this work we propose asolution to overcome this limitation by introducing Throttled SLP (TSLP), a novel vectorization algorithm that finds the optimal graph to vectorize, forcing vectorization to stop earlier whenever this is beneficial. Our experiments show that TSLP improves performance across a number of kernels extractedfrom widely-used benchmark suites, decreasing execution time compared to SLP by 9% on average and up to 14% in the best case.

languages and compilers for parallel computing | 2013

DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance

Konstantina Mitropoulou; Vasileios Porpodas; Marcelo Cintra

Compiler-based error detection methodologies replicate the instructions of the program and insert checks wherever it is needed. The checks evaluate code correctness and decide whether or not an error has occurred. The replicated instructions and the checks cause a large slowdown. In this work, we focus on reducing the error detection overhead and improving the system’s performance without degrading fault-coverage. DRIFT achieves this by decoupling the execution of the code (original and replicated) from the checks.

languages, compilers, and tools for embedded systems | 2013

LUCAS: latency-adaptive unified cluster assignment and instruction scheduling

Vasileios Porpodas; Marcelo Cintra

Clustered VLIW architectures are statically scheduled wide-issue architectures that combine the advantages of wide-issue processors along with the power and frequency scalability of clustered designs. Being statically scheduled, they require that the decision of mapping instructions to clusters be done by the compiler. State-of-the-art code generation for such architectures combines cluster-assignment and instruction scheduling in a single unified pass. The performance of the generated code, however, is very susceptible to the inter-cluster communication latency. This is due to the nature of the two clustering heuristics used. One is aggressive and works well for low inter-cluster latencies, while the other is more conservative and works well only for high latencies. In this paper we propose LUCAS, a novel unified cluster-assignment and instruction-scheduling algorithm that adapts to the inter-cluster latency better than the existing state-of-the-art schemes. LUCAS is a hybrid scheme that performs fine-grain switching between the two state-of-the art clustering heuristics, leading to better scheduling than either of them. It generates better performing code for a wide range of inter-cluster latency values.

international conference on supercomputing | 2016

Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication

Konstantina Mitropoulou; Vasileios Porpodas; Xiaochun Zhang; Timothy M. Jones

Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While these queues perform well for coarse-grained data transfers, they perform poorly in the fine-grained case. This paper proposes Lynx, a novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and dequeue. To achieve this it relies on existing commodity processor hardware and operating system exception handling support to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57x in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Real applications using Lynx get a performance improvement of up to 1.4x.

compilers architecture and synthesis for embedded systems | 2013

CAeSaR: unified cluster-assignment scheduling and communication reuse for clustered VLIW processors

Vasileios Porpodas; Marcelo Cintra

Clustered architectures have been proposed as a solution to the scalability problem of wide ILP processors. VLIW architectures, being wide-issue by design, benefit significantly from clustering. Such architectures, being both statically scheduled and clustered, require specialized code generation techniques, as they require explicit Inter-Cluster Copy instructions (ICCs) be scheduled in the code stream. In this work we propose CAeSaR, a novel instruction scheduling algorithm that improves code generation for such architectures. It combines cluster assignment, instruction scheduling and inter-cluster communication reuse all in one single unified algorithm. The proposed algorithm improves performance by any phase-ordering issues among these three code generation and optimization steps. We evaluate CAeSaR on the MediabenchII and SPEC CINT2000 benchmarks and compare it against the state-of-the-art instruction scheduling algorithm. Our results show an improvement in execution time of up to 20.3%, and 13.8% on average, over the current state-of-the-art across the benchmarks.

design, automation, and test in europe | 2007

A Decoupled Architecture of Processors with Scratch-Pad Memory Hierarchy

Athanasios Milidonis; Nikolaos Alachiotis; Vasileios Porpodas; Harris E. Michail; Athanasios P. Kakarountas; Costas E. Goutis

This paper present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the memory hierarchy and the second for processing the application data. The first program is executed by one of the decoupled processors called Access which uses compiler methods for placing data in the memory hierarchy. In parallel, the second program is executed by the other processor called Execute. The synchronization of the memory hierarchy and the Execute processor is achieved through simple handshake protocol. The Access processor requires strong communication with the memory hierarchy which strongly differentiates it from traditional uniprocessors. The architecture is compared in performance with the MIPS IV architecture of SimpleScalar and with the existing decoupled architectures showing its higher normalized performance. Experimental results show that the performance is increased up to 3.7 times. Compared with MIPS IV the proposed architecture achieves the above performance with insignificant overheads in terms of area

symposium on code generation and optimization | 2018

Look-ahead SLP: auto-vectorization in the presence of commutative operations

Vasileios Porpodas; Rodrigo C. O. Rocha; Luís Fabrício Wanderley Góes

Auto-vectorizing compilers automatically generate vector (SIMD) instructions out of scalar code. The state-of-the-art algorithm for straight-line code vectorization is Superword-Level Parallelism (SLP). In this work we identify a major limitation at the core of the SLP algorithm, in the performance-critical step of collecting the vectorization candidate instructions that form the SLP-graph data structure. SLP lacks global knowledge when building its vectorization graph, which negatively affects its local decisions when it encounters commutative instructions. We propose LSLP, an improved algorithm that can plug-in to existing SLP implementations, and can effectively vectorize code with arbitrarily long chains of commutative operations. LSLP relies on short-depth look-ahead for better-informed local decisions. Our evaluation on a real machine shows that LSLP can significantly improve the performance of real-world code with little compilation-time overhead.

international conference on parallel architectures and compilation techniques | 2017

SuperGraph-SLP Auto-Vectorization

Vasileios Porpodas

SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors starting from specific seed instructions in a bottom-up way. This approach, however, suffers from two main problems: (i) the algorithm may not reach instructions that could have been vectorized, and (ii) atomically operating on individual SLP graphs suffers from cost overestimation when consecutive SLP graphs share data. Both issues lead to missed vectorization opportunities even in simple code.In this work we propose SuperGraph-SLP (SG-SLP), an improved vectorization algorithm that overcomes these limitations of the existing algorithm. SG-SLP operates on a larger region, called the SuperGraph. This allows it to reach and successfully vectorize code that was previously unreachable. Moreover, the new region helps eliminate the inaccuracies in the cost-calculation as it allows for a more holistic view of the code. Our experiments show that SG-SLP improves the vectorization coverage and outperforms the state-of-the-art SLP across a number kernels by 36% on average, without affecting the compilation time.

Explore More