Pedro Trancoso
Chalmers University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pedro Trancoso.
computing frontiers | 2017
Mwaffaq Otoom; Aamer Jaleel; Pedro Trancoso
The trend of increasing the number of cores in a processor will lead to certain challenges, among which the fact that more cores issue more memory requests and this in turn will increase the competition, or interference, for shared resources such as the Last-Level Cache (LLC). In this work we focus on the cache interference while executing Decision Support System queries, which is a common case for a Data Center scenario. We study the co-execution of different queries from the TPC-H benchmark using the PostgreSQL DBMS system on a multicore with up to 16 cores and different LLC configurations. In addition to the working set metric, to better understand the effects of co-execution, we develop two new personality metrics to classify the behavior of the queries in co-execution: social and sensitive metrics. These metrics can be used to manage the cache interference and thus improve the co-execution performance of the queries.
Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems | 2017
Andreas Diavastos; Pedro Trancoso
Scheduling task-based parallel applications on many-core processors is becoming more challenging and has received lots of attention recently. The main challenge is to efficiently map the tasks to the underlying hardware topology using application characteristics such as the dependences between tasks, in order to satisfy the requirements. To achieve this, each application must be studied exhaustively as to define the usage of the data by the different tasks, that would provide the knowledge for mapping tasks that share the same data close to each other. In addition, different hardware topologies will require different mappings for the same application to produce the best performance.n In this work we use the synchronization graph of a task-based parallel application that is produced during compilation and try to automatically tune the scheduling policy on top of any underlying hardware using heuristic-based Genetic Algorithm techniques. This tool is integrated into an actual task-based parallel programming platform called SWITCHES and is evaluated using real applications from the SWITCHES benchmark suite. We compare our results with the execution time of predefined schedules within SWITCHES and observe that the tool can converge close to an optimal solution with no effort from the user and using fewer resources.
ACM Transactions on Architecture and Code Optimization | 2017
Andreas Diavastos; Pedro Trancoso
SWITCHES is a task-based dataflow runtime that implements a lightweight distributed triggering system for runtime dependence resolution and uses static scheduling and compile-time assignment policies to reduce runtime overheads. Unlike other systems, the granularity of loop-tasks can be increased to favor data-locality, even when having dependences across different loops. SWITCHES introduces explicit task resource allocation mechanisms for efficient allocation of resources and adopts the latest OpenMP Application Programming Interface (API), as to maintain high levels of programming productivity. It provides a source-to-source tool that automatically produces thread-based code. Performance on an Intel Xeon-Phi shows good scalability and surpasses OpenMP by an average of 32%.
computing frontiers | 2018
Adrian Cristal; Osman S. Unsal; Xavier Martorell; Paul M. Carpenter; Raúl de la Cruz; Leonardo Bautista; Daniel A. Jiménez; Carlos Álvarez; Behzad Salami; Sergi Madonar; Miquel Pericàs; Pedro Trancoso; Micha vor dem Berge; Gunnar Billung-Meyer; Stefan Krupop; Wolfgang Christmann; Frank Klawonn; Amani Mihklafi; Tobias Becker; Georgi Gaydadjiev; Hans Salomonsson; Devdatt Dubhashi; Oron Port; Yoav Etsion; Vesna Nowack; Christof Fetzer; Jens Hagemeyer; Thorsten Jungeblut; Nils Kucza; Martin Kaiser
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.
international conference on parallel processing | 2017
Stavros Tzilis; Miquel Pericàs; Pedro Trancoso; Ioannis Sourdis
This paper explores the potential of utilizing approximate system load information to enhance work stealing for dynamic load balancing in hierarchical multicore systems. Maintaining information about the load of a system has not been extensively researched since it is assumed to introduce performance overheads. We propose SWAS, a lightweight approximate scheme for retrieving and using such information, based on compact bit vector structures and lightweight update operations. This approximate information is used to enhance the effectiveness of work stealing decisions. Evaluating SWAS for a number of representative scenarios on a multi-socket multi-core platform showed that work stealing guided by approximate system load information achieves considerable performance improvements: up to 18.5% for dynamic, severely imbalanced workloads; and up to 34.4% for workloads with complex task dependencies, when compared with random work stealing.
acm international conference on systems and storage | 2017
Panayiotis Petrides; Pedro Trancoso
As the number of cores increases in a single chip processor, several challenges arise: wire delays, contention for out-of-chip accesses, and core heterogeneity. In order to address these issues and the applications demands, future large-scale many-core processors are expected to be organized as a collection of NUMA clusters of heterogeneous cores. In this work we propose a scheduler that takes into account the non-uniform memory latency, the heterogeneity of the cores, and the contention to the memory controller to find the best matching core for the applications memory and compute requirements. Scheduler decisions are based on an on-line classification process that determines applications requirements either as memory- or compute-bound. We evaluate our proposed scheduler on the 48-core Intel SCC using applications from SPEC CPU2006 benchmark suite. Our results show that even when all cores are busy, migrating processes to cores that match better the requirements of applications results in overall performance improvement. In particular we observed a reduction of the execution time from 15% to 36% compared to a random static scheduling policy.
Proceedings of the International Symposium on Memory Systems | 2017
Mats Rimborg; Pedro Trancoso; Gunnar Carlstedt
Parallelism is inherent in most problems but due to current programming models and architectures which have evolved from a sequential paradigm, the parallelism exploited is restricted. We believe that the most efficient parallel execution is achieved when applications are represented as graphs of operations and data, which can then be mapped for execution on a modular and scalable processing-in-memory architecture. In this paper, we present PHOENIX, a general-purpose architecture composed of many Processing Elements (PEs) with memory storage and efficient computational logic units interconnected with a mesh network-on-chip. A preliminary design of PHOENIX shows it is possible to include 10,000 PEs with a storage capacity of 0.6GByte on a 1.5cm2 chip using 14nm technology. PHOENIX may achieve 6TFLOPS with a power consumption of up to 42W, which results in a peak energy efficiency of at least 143GFLOPS/W. A simple estimate shows that for a 4K FFT, PHOENIX achieves 117GFLOPS/W which is more than double of what is achieved by state-of-the-art systems.
Proceedings of the International Symposium on Memory Systems | 2017
Alirad Malek; Evangelos Vasilakis; Vasileios Papaefstathiou; Pedro Trancoso; Ioannis Sourdis
An application may have different sensitivity to faults in different subsets of the data it uses. Some data regions may therefore be more critical than others. Capitalizing on this observation, Odd-ECC provides a mechanism to dynamically select the memory fault tolerance of each allocated page of a program on demand depending on the criticality of the respective data. Odd-ECC error correcting codes (ECCs) are stored in separate physical pages and hidden by the OS as pages unavailable to the user. Still, these ECCs are physically aligned with the data they protect so the memory controller can efficiently access them. Thereby, capacity, performance and energy overheads of memory fault tolerance are proportional to the criticality of the data stored. Odd-ECC is applied to memory systems that use conventional 2D DRAM DIMMs as well as to 3D-stacked DRAMs and evaluated using various applications. Compared to flat memory protection schemes, Odd-ECC substantially reduces ECCs capacity overheads while achieving the same Mean Time to Failure (MTTF) and in addition it slightly improves performance and energy costs. Under the same capacity constraints, Odd-ECC achieves substantially higher MTTF, compared to a flat memory protection. This comes at a performance and energy cost, which is however still a fraction of the cost introduced by a flat equally strong scheme.
MARC Symposium | 2011
Panayiotis Petrides; Andreas Diavastos; Pedro Trancoso
design, automation, and test in europe | 2018
Evangelos Vasilakis; Vassilis Papaefstathiou; Pedro Trancoso; Ioannis Sourdis